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Efficiency of delayed acceptance random walk 
Metropolis algorithms 

Chris Sherlock , Alexandre H. Thiery and Andrew Golightly 

Abstract: Delayed-acceptance Metropolis-Hastings (DA-MH) and delayed-acceptance pseudo-marginal 
Metropolis-Hastings (DAPsMMH) algorithms can be applied when it is computationally expensive to 
calculate the true posterior or an unbiased stochastic approximation thereof, but a computationally 
cheap deterministic approximation is available. An initial accept-reject stage uses the cheap approxima¬ 
tion for computing the Metropolis-Hastings ratio; proposals which are accepted at this stage are then 
subjected to a further accept-reject step which corrects for the error in the approximation. Since the 
expensive posterior, or the approximation thereof, is only evaluated for proposals which are accepted 
at the first stage, the cost of the algorithm is reduced. 

We focus on the random walk Metropolis (RWM) and consider the DAPsMRWM, of which the 
DARWM is a special case. We provide a framework for incorporating relatively general deterministic 
approximations into the theoretical analysis of high-dimensional targets. Then, justified by a limiting 
diffusion argument, we develop theoretical expressions for limiting efficiency and acceptance rates 
in high dimension. The results provide insight into the effect of the accuracy of the deterministic 
approximation, the scale of the RWM jump and the nature of the stochastic approximation on the 
efficiency of the delayed acceptance algorithm. The predicted properties are verified against simulation 
studies, all of which are strictly outside of the domain of validity of our limit results. The theory also 
informs a practical strategy for algorithm tuning. 

Keywords and phrases: Markov Chain Monte-Carlo, Delayed acceptance, Pseudo-marginal MCMC, 
Particle methods, Diffusion limit. 
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1. Introduction 

The Metropolis-Hastings (MH) algorithm is very widely used to approximately compute expectations with 
respect to complicated high-dimensional posterior distributions [GRS96, BGJM11]. The algorithm requires 
that it be possible to evaluate point-wise the posterior density n up to a fixed but arbitrary constant of 
proportionality. 

The use of a surrogate model to accelerate Bayesian computations has a long history. A well-established 
technique for handling computational expensive models, proposed in [SWMW89], is to construct a cheap 
surrogate model using Gaussian process regression. Similar ideas have been used for speeding up MCMC 
[Ras03, FNL11], 

The delayed-acceptance Metropolis-Hastings (DAMH) algorithm [LiuOl, HLB02, OL04, CF05, MFS08, 
HRM + 11, CFOll], also called two-stage MCMC in the engineering literature, assumes that the exact pos¬ 
terior 7T is available up to a constant of integration, but is computationally expensive to evaluate. This 
framework is particularly relevant to the Bayesian approach to inverse problems [KS06, StulO] where point 
estimations of the posterior density typically involve numerically solving sets of partial differential equations. 
A fast approximation, 7r a , is therefore employed as a first “screening” stage, with proposals which are rejected 
at the screening stage simply discarded. The correct posterior, 7r, is only evaluated for proposals which pass 
the screening stage; a second Metropolis-Hastings accept-reject step, which corrects for the error in the fast 
approximation, is then calculated so that the desired true posterior is obtained as the limiting distribution of 
the Markov chain. The DAMH algorithm thus provides, among other things, a principled way of leveraging 
deterministic approximations [LWG10, BBG + 11, FWA+11, PSSW, GJ14] to the posterior distribution in 
inverse problem modeling. 

The pseudo marginal Metropolis-Hastings (PsMMH) algorithm [Bea03, AR09] allows Bayesian inference 
when only an unbiased stochastic estimate of the target density, possibly up to an unknown normalisation 
constant, is available. The particle marginal Metropolis-Hastings (PMMH) algorithm [ADH10], a special 
instance of the PsMMH algorithm when the unbiased estimate are obtained by using a particle filter, is a 
popular method for estimating parameters in hidden Markov models [GW11, KdV12]. When random walk 
proposals are used, this leads to the pseudo marginal random walk Metropolis (PsMRWM) algorithm; this 
algorithm has the advantage of not requiring further information about the target, such as the local gradient 
or Hessian. If it is not possible to evaluate the posterior then it is usually also impossible to evaluate these 
quantities and they are generally more computationally expensive to approximate than the target itself 
[PDS11], 

It has been shown [AR09, AV15] that the mixing efficiency of a pseudo-marginal algorithm increases with 
decreasing variability in the stochastic approximation. However, decreasing the variability of the stochastic 
estimates of the target density typically comes at a computational price; this leads to a trade-off between 
mixing efficiency and computational expense and suggests that, for a given algorithm and target, there might 
be an optimal value for the stochasticity of the unbiased estimate, a tuning parameter often easily controlled. 
For example, for the PMMH algorithm that uses a particle filter for constructing the unbiased estimate to 
the target distribution, this trade-off translates into choosing an optimal number of particles. The existing 
literature on this topic is reviewed in Section 2.2. 
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The computational expense involved in creating each unbiased stochastic estimate in the PsMRWM (and, 
more generally the PsMMH) algorithm suggests that an initial accept-reject stage using a computationally 
cheap, deterministic, approximation [vKOl, BC14] to the posterior might be beneficial. This motivates [Smill, 
GHS14] the delayed-acceptance PsMMH algorithm (DAPsMMH). In this article, we focus on the case where 
the algorithm uses random walk proposal, that is, the delayed-acceptance pseudo marginal random walk 
Metropolis (DAPsRWM) algorithm. 

1.1. Contributions and organisation of the paper 

To the best of our knowledge, a systematic theoretical analysis of delayed-acceptance type MCMC algorithms 
is lacking in the literature. In this article we examine the efficiency of the DAPsMRWM for high-dimensional 
problems as a function of the scaling of the random walk proposals and of the distribution and computational 
cost of the unbiased estimator of the posterior. This theory also applies to the DARWM, since it is a special 
case of the DAPsMRWM. 

Our innovation is to assume that the error in the cheap deterministic approximation is a realisation of a 
random function. Subject to assumptions about the form of this function, of the target and of the stochastic 
approximation, a diffusion limit is obtained through an homogenization argument. Despite the flexibility 
inherent in our specification of the deterministic approximation, the form of the diffusion depends on the 
random function through just two key scalar properties. The volatility coefficient of the limiting Langevin 
diffusion, which is a rescaled version of the expected squared jump distance (ESJD) [RGG97], characterizes 
the asymptotic mixing properties of the DAPsMRWM [RR14a]. 

We then focus on a specific ‘standard’ asymptotic regime which occurs, for instance, when the unbiased 
stochastic estimates are obtained through a particle filter. We consider the overall efficiency taking compu¬ 
tational time into account and leverage the mixing results to investigate (for all algorithms) its relationship 
with the random-walk scaling, the accuracy of the deterministic approximation and (for pseudo-marginal al¬ 
gorithms) the variability of the stochastic estimator. The analysis sheds light on how computationally cheap 
the deterministic approximation needs to be to make its use worthwhile and on the relative importance of it 
matching the ‘location’ and curvature of the target. The predictions also provide a strategy for tuning the 
algorithm by finding the optimal scaling and (for DAPsMRWM) the optimal number of particles. 

Three simulation studies verify different aspects of the theory and theoretical predictions. A pivotal result 
on the relationship between changes in the posterior and changes in the deterministic approximation is verified 
against a toy Bayesian inverse problem where the error in the deterministic approximation may be evaluated 
pointwise quickly and exactly; the predicted influences of the two scalar properties of the deterministic 
approximation, the RWM scaling and the nature of the stochastic approximation on the acceptance rates 
and overall efficiency are verified against a simple product example where the two scalar quantities may be 
varied and calculated exactly; finally, the predicted effects of the RWM scaling and the stochastic error on 
efficiency, as well as the tuning strategy itself are verified for pseudo-marginal inference on the parameters 
governing a Markov jump process. 

The article proceeds as follows. Section 2 builds up to a description of the DAPsMRWM algorithm 
through its constituent algorithms and provides a brief review of the literature on the efficiency of pseudo¬ 
marginal algorithms. Section 3 describes the high-dimensional asymptotic regime studied in this article, 
sets up the models for the two approximations to the posterior and states the assumptions that are made 
on the posterior itself. In Section 4, we develop an asymptotic analysis of the DAPsMRWM; we formally 
introduce the expected squared jump distance and obtain asymptotic properties. A diffusion limit that gives 
theoretical justification for the optimization study presented in the subsequent section is then established. 
The asymptotic result are leveraged in Section 5 where we discuss the tuning of the DAPsMRWM algorithm. 
The proofs and technical results are gathered in Section 6 and Appendix A. Section 7 provides practical 
advice and ratifies this and other aspects of our theory against simulation studies. The article concludes 
with a discussion. 
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2. The delayed-acceptance Pseudo-Marginal RWM 


2.1. The algorithm 


Consider a statespace X C ]R d and let n be a distribution on X; the density of the distribution n with 
respect to the Lebesgue measure will also be referred to as n. The Metropolis-Hastings updating scheme 
provides a very general class of algorithms for obtaining an approximate dependent sample from 7r by 
constructing a Markov chain with 7r as its limiting distribution. Given the current value x £ X, a new 
value x* is proposed from a pre-specified Lebesgue density <7 (x, x*) and is then accepted with probability 
ot\ (x;x*) = min{l, [7r(x*) q (x*, x)]/[7r(x) <7 (x, x*)]}. If the proposed value is accepted it becomes the next 
current value otherwise the current value is left unchanged. 

The particle marginal Metropolis-Hastings (PMMH) algorithm [ADH10] is a special case of the Pseudo 
Marginal Metropolis-Hastings (PsMMH) algorithm [Bea03, AR09]. These algorithms presume that it is either 
not possible or is computationally infeasible to evaluate the true posterior density, even up to a multiplicative 
constant, but that an unbiased estimate 7r(x*) of 7r(x*) is available. Following [PdSSGK12, Shel3], we write 
the unbiased estimate of 7r(x*) as 7r(x*) = n(pc*)e w where W* = log[7r(x*)/7r(x*)] is a random variable 
whose distribution is governed by the randomness occurring when producing the stochastic approximation 
to 7r(x*); we sometimes write 7r(x*; w*) instead of 7r(x*) e w in order to stress the value of w*. The random 
variable W* is typically intractable and is only introduced in this article as a convenient way to analyse the 
PsMRWM algorithm. Let | x*) be the (conditional) density of W*; the unbiasedness of the estimate 

7T (x*) yields that 

E [exp (W*)\ x*] = f exp (w*) ttw*(w* | x*) dw* = 1 (2.1) 

J R 

for any x* £ X. The PsMMH algorithm creates a Markov chain which has a stationary distribution with 
density 

7r(x, w) oc 7t(x) 7 Tw(w | x) where itw{w | x) = ttw*(w \ x)e w . (2-2) 

Equation (2.1) shows that 7 T\y(w | x) is a valid conditional density. Since marginalizing over w yields 7r(x), 
sampling from 7r(x, w) and discarding w is equivalent to sampling from 7r(x). When a new parameter value x* 
is proposed via a kernel q(x. dx*), a stochastic estimate 7r(x*; w*) to the posterior density at x* is computed: 
generating the unbiased estimate n(x*-,w*) is equivalent to proposing a new value w* distributed according 
to 7 tw ( dw * | x*) . The stochastic estimate of the posterior density at x is not re-evaluated. The pair (x* ,w*), 
or equivalently x* and its stochastic estimate 7 t(x*\w*), is then jointly accepted or rejected with probability 
given by the usual Metropolis-Hastings ratio; in this case, the acceptance probability simplifies to 

. (, 7? {x*-,w*)q(x*,x)\ 

mm 1. 7 -— . 

V ^(x; w) q(x. x*) J 


Thus, we are able to substitute the estimated posterior for the true posterior and still obtain the desired 
stationary distribution for x. 

A large class of problems for which the PsMMH algorithm is appropriate are set up as follows. Noisy 
observations y = ( 3 / 1 , ...,?/„) of a hidden Markov process are available at various time-points; the noise and 
the dynamics of the hidden process are parametrised by a parameter x. A prior density 7r 0 (x) is provided 
but for a given parameter x it is often impossible to compute the likelihood £ (x; y) of the observations; the 
posterior density 7r(x) is thus typically unavailable. However, given a parameter x, it is often straightforward 
to simulate a realisation of the stochastic process at the next observation time given the value at the 
current observation time. This act, repeated for different realisations for each inter-observation time and 
with appropriate observation-dependent re-sampling after each interval, forms the basis of the particle filter 
[GSS93] and its various extensions [DouOl]. The particle filter provides an unbiased stochastic estimate 
£(x\ y) of the likelihood [DM04] which leads to an unbiased (up to a fixed constant) estimate of the posterior. 


Under mild assumptions [Schl2, Whil2, DM012], the relative variance Var £(x-,y /£(x;y)) grows at most 


linearly with the length of the time-series; this should be contrasted with the typical exponential growth of 
the variance of the naive importance sampling Monte Carlo estimator. Applying these unbiased estimates 
of the posterior within a pseudo-marginal MH algorithm leads to the PMMH algorithm [ADH10]. As the 
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likelihood estimator is often relatively stable in comparison, for instance, to off-the-shelf importance sampling 
estimators, the PMMH algorithm provides a powerful and flexible framework for inference using state-space 
models [FS11, GW11], 

Now suppose that in addition to the computationally expensive unbiased stochastic approximation 7r(x*; W*) 
to the posterior density 7r(x*) we also have a computationally cheap deterministic approximation 7r a (x*) of 
it. The following delayed-acceptance pseudo-marginal Metropolis Hastings (DAPsMMH) algorithm [Smill, 
GHS14] combines the pseudo-marginal Metropolis-Hastings algorithm with the delayed-acceptance Metropolis- 
Hastings algorithm. Given the value of the parameter x and the stochastic estimate 7r(x; w) of the posterior, 
or equivalently given a pair (x, w), an iteration of the algorithm proceeds as follows. 

1. Propose a new value x* from the kernel density g(x, x*). 

2. Stage One: With probability 


oq (x; x*) := min 



7G(x*) g(x*,x) \ 

7r a (x)q(x,x*) ) 


proceed to Stage Two; otherwise set k <— k + 1 and go to 1. 

3. Stage Two: compute a stochastic estimate 7f(x*;w*) of the posterior density at x*, or equivalently 
T> 

generate w* ~ 7 xw*{dw* | x*). With probability 


« 2 |i ( x , w; x*, w*) = min 


/ 7f(x*;w*)7r a (x) \ 

V ’ 7?(x;w)7r a (x*) J 


(2.3) 


accept the proposal (x,w) <— (x*,u>*); otherwise reject it. Set k <— k + 1 and go to 1. 

The Stage Two acceptance probability may be obtained by considering the act of proposing from the kernel q 
and passing the proposal through Stage One with probability aq as a Metropolis-Hasting proposal mechanism 
itself. Clearly, the more accurate the deterministic approximation 7r, the higher the Stage Two acceptance 
probability. The overall acceptance probability is oq 2 (x, w\ x*, w*) = oq (x; x*) a 2 |i ( x , w, x*, w*). 

For a symmetric proposal density kernel g(x, x*) such as is used in the RWM algorithm, the Stage One 
acceptance probability simplifies to 


oq (x; x*) := min 



(Ux*)\ 

(x) J 


(2.4) 


If the true posterior is available then the delayed-acceptance Metropolis-Hastings algorithm is obtained by 
substituting this for the unbiased stochastic approximation in (2.3). 


2.2. Optimising RWM and PMRWM algorithms 

It is well known [RR01, SFR10] that the efficiency of a given RWM algorithm varies enormously with the scale 
of the proposed jumps. Small proposed jumps lead to high acceptance rates but little movement across the 
state-space, whereas large proposed jumps lead to low acceptance rates and again to inefficient exploration of 
the state space. The problem of choosing the optimal scale of the RWM proposal has been tackled for various 
shapes of target [RGG97, RR01, Bed07, BRS09, SR09, Shel3] and has led to the following rule of thumb: 
choose the scale so that the acceptance rate is approximately 0.234. Although nearly all of the theoretical 
results are based upon limiting arguments in high dimension, the rule of thumb appears to be applicable 
even in relatively low dimensions [SFR10]. 

In discussing the literature on optimising pseudo-marginal algorithms it is helpful to define the Standard 
Asymptotic Regime (SAR), where the noise in the stochastic estimator of the log-posterior is additive and 
Gaussian with a variance that is inversely proportional to the CPU time required to produce the estimate. 
The SAR is justified in the case of a particle filter by the asymptotic result in [BDMD13]. It should also be 
noted that all of articles below assume that either the distribution of the noise in the posterior is independent 
of the position in the target space, or that the number of particles is allowed to vary so that this is the case. 

[PdSSGK12] examine the optimal choice of the number of particles m > 1 for a particle MCMC algorithm 
under the SAR. The criterion for optimality is the integrated autocorrelation time (IACT). Under the further 
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(unrealistic) assumption that the Metropolis-Hastings algorithm is an independence sampler which proposes 
exactly from the desired target distribution, it is shown that the PsMMH algorithm is most efficient when the 
variance of the noise in the estimated log-posterior is 0.92 2 . [DPDK15] consider a general Metropolis-Hastings 
algorithm and define a parallel hypothetical Markov chain under the assumption that the true posterior is 
available. The true posterior and the noise are separated out and two probabilities are calculated using the 
Metropolis-Hastings ratio: a EX , the acceptance probability for a hypothetical Metropolis-Hastings algorithm 
under the true posterior and a ®, the acceptance probability for the noise on the log-posterior assuming an 
independence sampler which proposes from the assumed Gaussian density. For the parallel chain a proposed 
move is accepted with probability a Ex oft. The acceptance rate for this parallel chain is always less than 
that for the true chain, and its efficiency provides a tractable lower bound on the efficiency of the true chain. 
Under the SAR, it is shown that the IACT of the bounding chain is minimised when the variance of the noise 
in the estimated log-posterior is between 0.92 2 and 1.68 2 , with the exact value depending on the magnitude 
of the integrated autocorrelation time on a hypothetical Markov chain on the true posterior, with acceptance 
probability a EX . 

[STRR15] examine the behaviour of the pseudo-marginal random walk Metropolis algorithm under various 
regimes for the noise in the estimate of the posterior. Mixing efficiency is considered in terms of both limiting 
expected squared jump distance and the speed of a limiting diffusion (leading to identical formulae), and an 
overall efficiency (ESJD/sec) is defined, which takes into account the total computational time. Under the 
SAR, joint optimisation of this efficiency with respect to the variance of the noise in the log-target and the 
RWM scale parameter, A, is considered. It is shown that the optimal scaling occurs when the acceptance 
rate is approximately 7.0% and the variance of the noise in the estimate of the log-posterior is approximately 
3.3. [STRR15] also note that for the two different noise distributions considered in the article, the optimal 
scaling appears to be insensitive to the noise variance, and even to the distribution. This phenomenon is 
shown to hold across a large class of noise distributions in [Shel5]. 

This article extends [STRR15] to the corresponding delayed-acceptance algorithm. Results on limiting 
acceptance rates and mixing efficiency are proved, as is a diffusion limit. Efficiency under the Standard 
Asymptotic Regime is then considered in detail. We also note consequences for the more straightforward 
delayed-acceptance RWM algorithm. 

3. High dimensional regime 

In this section we introduce the high-dimensional asymptotic regime to be analysed in Sections 4, 5 and 
6. In Section 3.1, the target distributions are described. In Section 3.2 and 3.3 respectively, we introduce 
the deterministic and stochastic approximation to the target distribution and the associated notations. We 
conclude in Section 3.4 with a careful description of the two-stage accept-reject mechanism. 

3.1. Product form target distributions 

We consider in this article target densities that have a simple product form. A research program along these 
lines was initiated by Roberts and coworkers in the pair of papers [RGG97, RR98]; although only simple 
exchangeable product form targets were considered, a range of subsequent theoretical analyses confirmed 
that the results obtained in these articles also hold for more complex target distributions, such as products 
of one-dimensional distributions with different variances and elliptically symmetric distributions [RR01, 
BPS04, SR09, Bed07, SFR10]; infinite-dimensional extension were obtained in [MPS12, PST12, PST14]. 
The d-dimensional target probability distribution tt^ on W 1 that we consider in this article has independent 
coordinates; it is of the form 


d 

n (d) (dx 1 ,... ,dx d ) = (^)n(dXi) (3.1) 

i= 1 

for a one-dimensional marginal distribution 7r = 7%' 1 on R. For simplicity of exposition, we assume that all 
of the probability distributions have a density with respect to the Lebesgue measure and write densities for 
distributions interchangeably; in terms of densities, Equation (3.1) reads = Ilf=i 7r ( a: *)- Throughout 
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this article we assume that the Markov chain (Xl rf ), W) is stationary so that marginally each component of 
Xb^ has distribution tt. We consider Gaussian random walk proposals: for a current position x^ £ W 1 , the 
proposal X^* is assumed to be distributed as 

X (d) * = x (d) + (y) d~ 1/2 Z (d) (3.2) 

for a standard centred Gaussian random variable a tuning parameter p > 0 and a target dependent 
coefficient I > 0 expressed as 


I 2 = E [i'{.X ) 2 ] = -E [t"{X)\ (3.3) 

T> 

for a scalar random variable X ~ 7 r and log-likelihood function £ = log 7 r. The second equality in Equation 
(3.3) follows from an integration by parts that is justified, for example, by the regularity Assumption 4 
described in Section 4. The constant I is introduced to simplify the statements of the results to follow. 
The scaling dr 1 / 2 ensures that, in the high-dimensional regime, d —> oo, the mean acceptance probability 
of a standard Random Walk Metropolis algorithm with proposals (3.2) and target distribution (3.1) stays 
bounded away from zero and one; under mild assumptions, it is optimal [RGG97, Bed07, BRS09, MPS12], 

3.2. Deterministic approximation 

The error in the deterministic approximation to the log-posterior, s(x) := log ( 7 r a (x)/ 7 r(x)), is fixed for any 
given x. We choose to model it as a realisation of a random function. In our setting the target is a d- 
dimensional product of one-dimensional distributions and we imagine that each of the terms in this product 
is approximated through an independent realisation of a random function. This means that the deterministic 
approximation Tri d \x) = n^ d \x) x exp (s^(x)) to the posterior density 7 r( d )(x) possesses a deterministic 
error, on a logarithmic scale, of the form 


d 

s (d) (x) = ^5(a:i,7i), (3.4) 

2—1 

where { 7 i}i>i is the realisation of an i.i.d sequence of auxiliary random variables {Tj}j>i; without loss of 
generality, we can assumes that these auxiliary random variables are uniformly distributed on the interval 
(0,1). In the remainder of this text, we assume that the deterministic function S : R x (0,1) —> R in Equation 
(3.4) satisfies the regularity Assumption 4 stated below. The following two properties of the function S 
directly influence the limiting efficiency of the delayed acceptance algorithm, 

P 1 =^E[d xx S(X,T)] and ft = *E [ d x S(X, T ) 2 ] 1/2 , (3.5) 

T) T) 

where expectation is taken over two independent random variables T ~ Uniform(0,1) and X ~ n. An 
integration by parts and the Cauchy-Schwarz inequality yield that 

|A] < 02- (3.6) 

The quantity — f3\ may be interpreted as a measure of the excess curvature in the deterministic approxima¬ 
tion, whereas /? 2 is a measure of total discrepancy in the gradient. The following simple example illustrates 
this point and provides an intuitive basis for some of our theoretical results in Section 5. 

Example 3.1. Consider a standard centred Gaussian target in with inverse covariance matrix E -1 = 
diag (L,...,L) and an approximate distribution whose i th coordinate is distributed as N (a(Tj), ^Ej)” 1 ) . 
This gives /3i = E [1 — b{T)/L\ and /3| = E [(1 - b(T)/L) 2 + a(T) 2 6 2 (r)/T]. 

It seems natural that a good approximation would match the curvature of the target, and indeed a match¬ 
ing of the curvature of an effectively-unimodal target at its mode is the basis of many importance samplers 
and independence samplers. However in many scenarios, such as the real statistical example considered in 
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Section 7.3, the user has a single approximation and is not at liberty to choose the best from a whole family. 
Section 7.1 details a short simulation study in d = 10 where a Gaussian target is approximated by a logistic 
density and includes investigations of several different choices for the curvature with the mode fixed at the 
truth. Both the DAPsMRWM and the DARWM algorithms are considered. The study shows that whilst the 
best gain in efficiency is obtained when f}\ ~ 0 (and the mode is in the correct location), a substantial gain 
in efficiency can still be obtained even when the curvature of the approximation does not match that of the 
target. In all that follows we therefore consider the general case with j3i 0. 

3.3. Stochastic approximation 

In the remainder of this article, for the study of the stochastic approximation, we adopt the parametrisation 
introduced in Section 2.1; we set W, W* € R implicitly defined through the equations 

7 ?( d )(x*) = 7 r^(x*) e w and 7 ?^(x) = 7 r^(x) e w . 

The hypothesis that if w (x*) is an unbiased estimate to 7 rG) (x*) means that W* , whose distribution implicitly 
depends upon x*, satisfies E [exp (W*) | x*] = 1. For simplicity, and as in the articles [PdSSGK12, DPDK15, 
STRR15], we assume the following. 

Assumptions 1. The additive noise, W*, in the estimated log-target at the proposal, X*, is independent 
of the proposal itself. We write ttw* to denote its distribution. 

This assumption means that, for any value of the proposal X*, the stochastic estimate to the target 
7i-( d )(X*) can expressed as 7r( d )(X*) x e w for a random variable W* ~ ir\y* independent of any other source 
of randomness; the distribution of the auxiliary random variable W* is independent of X*. The noise term 
within the Markov chain, W, does not have the same distribution as the noise in the proposal, W*, since, 
for example, moves away from positive values of W are more likely to be rejected than moves away from 
negative values. Standard arguments show that in our d-dimensional setting, the process (X^, is a 

Markov chain with invariant distribution ® nw where 

= exp(iw). (3.7) 

CL7T\y* 

This is Lemma 1 of [PdSSGK12]. In Section 5, we examine the behaviour of the algorithm under the following 
Gaussian assumption. 

Assumptions 2. The additive noise in the estimated log-target at the proposal, IT*, is Gaussian and is 
independent of the proposal itself, X*, 

WT £ N(-a 2 /2, CT 2 ) . (3.8) 

Note that in Equation (3.8) the mean is determined by the variance so as to give an unbiased estimate of 
the posterior, E [exp (IP*)] = 1. It follows from (3.7) that at stationarity, under Assumption 2, we have 

IP £ N (ct 2 /2, a 2 ) . (3.9) 

This article focuses on algorithms where the stochastic approximation to the likelihood is computationally 
expensive; in most scenarios of interest [GW11, KdV12, GHS14, FG14] the stochastic approximation is 
obtained through Monte-Carlo methods (e.g. importance sampling, particle filter) that converge at the 
standard A -1 / 2 rate where N designates the number of samples/particles used. For taking into account the 
computational costs necessary to produce a stochastic estimates of the target-density, we thus assume the 
following in the rest of this article. 

Assumptions 3. When Assumption 2 holds, the computational cost of obtaining an estimate of the log- 
target density with variance a 2 is inversely proportional to a 2 . 

The recent article [BDMD13] shows, among other things, that for state space models the unbiased estimate 
of the likelihood obtained from standard particle methods [DM04] satisfies a log-normal central limit theorem, 
as the number of observations and particles goes to infinity, if the number of particles is of the same order 
as the number of noisy observations. This justifies the Gaussian approximation (3.8) and shows that the 
log-error is asymptotically inversely proportional to the number of particles used, justifying Assumptions 3. 
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3 . 4 . Acceptance probabilities 


In this section we give formulae for the different acceptance probabilities when the DAPsMRWM algorithm 
is used for product-form targets as described in Section 3.1. When the current position of the algorithm is 
(x( d \w( d )) € R d x R, a proposal distributed as 

X (d )’* = x (d) + (!) d ~ 1/2 z(d) and wid) ’* ~ nw (3.10) 

is generated. In this section and subsequently we will need to refer to three separate quantities and to 
distinguish for each which parts are fixed and which are random. We therefore define 

q^ = log s (d) = s^(x (d) ’*) - s M (x (d )), w= w (d) '* - w<- d \ 

Q (d) = log [7r( d )(XW’*)/7r( d )(x( d ))], = s( d )(X (d) ’*) - s (d) (x (d) ), \N (d) = W^’* - wW, (3.11) 

Q (d) = log [7r (d) (X( d )’*)/7r( d )(X (d ))], S i d) = s^lfX^)’*) - s^X^), W^ = W^’* - . 


The deterministic approximation to the posterior density is used for a first screening procedure and 
the stochastic approximation is used for the second part of the accept-reject mechanism; with the notations 
introduced in the previous sections, the Stage One and overall acceptance probabilities read 


a W ( x (<0, w (<0 ;x («0,*, W M,*) = F f q W + s (,d)\ 


(3.12) 


for the Metropolis-Hastings accept-reject function F(u) = min (1, exp(rt)). Note that most of the proofs 
readily adapt, under mild regularity assumptions, to the case where F : K. —x (0,1] is a continuous and 
increasing function that satisfies the reversibility condition e~ u F[u) = F{—u ) for all u £ R. When the 
current position of the algorithm is (x^ d \w^), the first and second stage acceptance rate are defined by 


£ 

A 

II 

m 

a[ d) 

£ 

II 

m 

s 

s 

§ 

s 


for a proposal (X^ d, *\ W^’*) distributed as in (3.10). The conditional second stage acceptance rate is defined 
through Bayes rule as , w^) = (x^to^) /a^ (x.^, w^). We will eventually be interested 

in the acceptance rate at Stage One and the overall acceptance rate, which are defined as 


a 


(d) 


= E 


Ad) 


(x (d) , W {d) 


and 


(d) 

*12 


= E 


a 


(x (d) , W (d) ) 


with (XS d \ W^) ~ ® 7 Tw, as well as in the conditional Stage Two acceptance rate = at^/ctf^. 


4. Asymptotic analysis 

In this section we investigate the behaviour of the DAPsMRWM algorithm in the high-dimensional regime 
introduced in Section 3. We make the following regularity assumptions. 

Assumptions 4. The density 7 r : K. —x (0, 00 ) and the function 5 : R x (0,1) -> R satisfy the following. 

1. The log-target function t = logx is thrice differentiable, with second and third derivative bounded. 
The quantity E [t"(A) 2 ] is finite, for X ~ 7 r. 

2. The first three derivatives of the function ( x , 7 ) 1 —x S(x , 7 ) with respect to the first argument exist and 
are bounded over Rx (0,1). 

Assumptions 4 are repeatedly used for controlling the behaviour of second-order Taylor expansions; they 
could be relaxed in several directions at the costs of increasing technicality in the proofs. The following 
lemma is pivotal, and is proved in Section 6.1. 
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Lemma 4.1. Let the regularity Assumptions 4 hold. Let {'y*}*>i be a realisation of the sequence of auxiliary 
random variable used to described the deterministic approximation (3.4) to the posterior density. Let {JQ}j>i 
be an i.i.d sequence marginally distributed as it and {xi},;>i be a realisation of it. For d > 1, set x^ = 

(xi, ..., Xd) £ and define the random variable X= x ^ + (ft/I) d ~ 1 / 2 Z^ for Z ^ S N (0, Id)- For 
almost all realisations {xt}i>\ and and meR, the following convergence in distribution holds, 

lim (q^S^) = (Q%,S%), 

where and S/f) are two Gaussian random variables marginally distributed as 

Qa and S%Z N^faft 2 ,^ (4.1) 

and with correlation p = —fa/fa £ (0,1) for parameters fa and fa defined in Equations (3.5). 

The fact that p £ [—1,1] is another manifestation of inequality (3.6). In the Gaussian Example 3.1 with 
b(T) = b > L and a(T) = 0 (and thus fa < 0) it is readily seen that q^ and s^ have the same sign and are 
positively correlated. In general, Lemma 4.1 shows that if the approximating density has an average excess 
of (negative) curvature (i.e. fa < 0), the limiting random variables and are positively correlated. 

4-1- Numerical confirmation of Lemma 4■ 1 

Lemma 4.1 is pivotal to the high-dimensional asymptotic analysis to be described in subsequent sections. The 
product form Assumptions (3.1) and (3.4) from which we derive the bivariate Gaussian distribution in Lemma 
4.1 are chosen for convenience. We expect the same conclusions to hold, at least approximately, in much 
broader settings; for example, we believe that extensions of Lemma 4.1 to non i.i.d target distributions similar 
to those discussed in [BPS04, Bed07, BR08, SR09, BRS09, PST12] are possible, at the cost of much less 
transparent proofs. In order to test the (approximate) validity of Lemma 4.1 in more realistic scenarios, and 
thus test the robustness of the results proved in this article, we consider in this Section a toy Bayesian inverse 
problem [StulO] where none of the i.i.d assumptions are satisfied. We consider the problem of reconstructing 
an initial one-dimensional temperature field represented by a continuous function T(-,t = 0) : [0,1] —> K. 
from N observations at time t = r corrupted by independent Gaussian additive noise with known variance 
CT noise > 0- I n other words, we collect {yi}^ =1 with j/j ~ N ( T(xi,t = T) , cr( oise ) for 1 < i < N at some 
location Xi £ (0,1). We assume that the evolution of the temperature field is described by the heat equation 
d t T = (1/2 )d xx T with Dirichlet boundary T(x = 0 ,t) = T(x = l,t) = 0 for all time t £ [0,r]. We adopt a 
Gaussian process prior on the initial and unobserved temperature field and represented this prior as a finite 
Karhunen-Loeve expansion 


K 

T(x,t = 0) = sin(/c 7 rcc), 

fc=l 

for independent Gaussian random variables ~ N (0, Kk)\ the decay of the sequence Kk > 0 controls the 
a-priori smoothness of the initial temperature field. We chose up = 1 /k and K = 40 in our simulations. 
We have chosen this simple Bayesian inversion problem since a closed form solution for the heat equation 
is available; this allows a straightforward analysis of the approximation. Our approximate target is ob¬ 
tained through a coarse discretisation of the heat equation on N x = 50 and N T = 10 equidistant spatial 
and temporal points and using a standard fully-implicit finite-difference scheme [e.g. Thol3]. We imple¬ 
mented an exact RWM algorithm in the Fourier domain; i.e. the initial temperature field T(-,t = 0) is 
represented by its Af-dimensional Karhunen-Loeve expansion c = (ci,..., ck)- The prior log-density equals, 
up to an irrelevant additive constant, —( 1 / 2 ) X]fe=i C k/ K k an d the log-likelihood reads, up to a constant, 
— (l/2c r n 0 ise) Y)iLiiVi -F(c)(xi)} 2 where E(c)(x) = J2k=i c k exp (-(& 7r) 2 t/2) sin(knx). The variance of 
the RWM proposal was proportional to the prior variance matrix (which is not optimal, but reasonable in 
our example) with a scaling A > 0 chosen so that roughly 25% of the proposals were accepted. 
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We focus on the aspects of Lemma 4.1 that are new: the properties of 5 a and its relationship with Qa- 
The marginal properties of Q a have been known for some time [RGG97]. We ran 10 5 iterations of the 
exact RWM Markov chain in the Fourier domain and investigated numerically the distribution of the pair 
(Q A , 5a) at the final position of the RWM chain (in order to be in the main mass of the target distribution); 
the Gaussian behaviour of (Q a, 5a) is confirmed, as well as its non-trivial correlation structure (Fig. 1). 
Furthermore, we repeated the same experiment (results not presented here) at several other locations in the 
bulk of the target distribution and the distribution of (Qa,5a) appears approximately independent of the 
location, as predicted by the theory. 



Fig 1. Empirical distribution of (Q Ai-S'a) evaluated from a current point in the bulk of the target distribution. The dashed 
lines in the left and right panels show the densities of Gaussian fits to the empirical marginal distributions of Qa and Sa 
respectively. 


To investigate the validity of Equation (4.1), we computed the quantities E[5a]/A 2 and Var[5A]/A 2 
and Corr [Qa,5a] for several choices of jump scaling A > 0; Lemma 4.1 predicts that these quantities are 
independent of the scaling A > 0, as is approximately numerically confirmed in Figure 2. 


Scaled expectation and variance of S A , and CorlQ^SJ 
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Fig 2. To investigate the validity of Equation (4.1), we computed the quantities — E[5a]/A 2 (o) and Var [Sa] /^ 2 (A) and 
Corr[QA-,SA] (~\~) f or several choices of jump scaling A > 0; Lemma f.l predicts that these quantities are independent of the 
jump size A > 0; this is approximately true in this Bayesian inverse problem toy example. 
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4.2. Limiting acceptance probability 


The following lemma identifies the limiting acceptance rates as the dimension d goes to infinity. 

Proposition 4.1. Let Assumptions 1 and 4 hold. For almost every realisation { 7 i}i>i of the sequence of 
auxiliary random variables used to described the deterministic approximation (3.4) to the posterior density 
we have 


lim E 

d—> 00 


i[ d) - ai 


= 0 


where the limiting acceptance rates are given by 
ai =E[F(Q% + S%)] and 


and 


lim E 

d—> 00 


a ( ${X.W,wW)-a 12 


= 0 


o 12 = E [F (Qa +S%)xF (W a - S%)] 


(4.2) 


(4.3) 


for (Qai^a) as described in (4.1) and Wa = W* — W for [W* ,W) ~ 7 T\y* 0 nyy. The dependence of a\ 
and a \2 upon (p, /3x, implicit. 

Corollary 4.1. Under Assumptions 1 and 4 we have lim^oo = ax and lim^oo a ^ = 072 - 

Proof of Proposition 4-1 . We prove the first limit in Equation (4.2); the proof of the second limit is analo¬ 
gous. The first limit is equivalent to 


lim E 

d—> 00 



x (d) 


- E [F {Qa + S% 


2 


= 0 . 


This follows from the dominated convergence theorem, since the function F is bounded and continuous, and 
from the almost sure convergence in distribution proved in Lemma 4.1. □ 


For the remainder of our discussion of acceptance rates we adopt the Metropolis-Hastings acceptance 
probability and we suppose that Assumption 2 holds: there is additive Gaussian noise in the logarithm 
of the stochastic approximation. We also make the dependence of the acceptance rate on the approxima¬ 
tion parameters, /? 1 and /3 2 , explicit. Standard computations (e.g. Proposition 2.4 of [RGG97]) yield that 
G(p, a 2 ) := E [F(N (/x, cr 2 ))] = 4 ’{p/a) + exp (/x + a 2 / 2) 4 >(—a — p/a), with 4> : R —» (0,1) the standard 
Gaussian cumulative distribution function, so that the Stage One acceptance rate is 


otx{p, cr;/3i, P2) — ^-(1 — /3i), /x- (l + /?2 — 


= $< - 


A + exp (i(A 2 - IhW ] x *< 


1 + 2 ^| -3/3! 

2-y/l + /3f — 2/?! 


(4.4) 


The limit as /3i —>■ 0 and /3 2 —> 0 corresponds to the case when there is no deterministic error and leads 
to the usual [RGG97, MPS12] limiting acceptance rate of 2 x 4>(— p/2). For computing the limiting overall 

acceptance rate, note that under the Gaussian Assumption 2 we have Wa ~ N (—cr 2 ,2cr 2 ); conditioning 
upon in Equation (4.3) yields that the limiting overall acceptance rate 012 (/x, cr; /3i, /3 2 ) reads 


E 


G -x(l - Pi)g 2 + P (/3 2 - P1/P2) Z, p 2 { 1 - P1/P2) xG -cr 2 - -p 2 px -pfas, 2a 


(4.5) 


with £ ~ N (0,1) a standard Gaussian random variable; we have used the fact that the random variable 
has the same distribution as /x 2 /3i/2 + pf 3 2 £ and that is a Gaussian vector with correlation 

p = ~Px/p 2 - The following result, whose proof is deferred to Appendix A.l, shows that it is possible to 
characterise the (unknown) values of p and a 2 in terms of the Stage One and the conditional Stage Two 
acceptance rates. 

Proposition 4.2. Let Assumptions 1, 2 and 4 hold and let the Metropolis-Hastings accept-reject function 
be used. 
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1. For any fa > 0 and fa < 1 the Stage One acceptance rate oq(^t; fa, fa) is a continuous decreasing 
hijection in p from [0, oo) to (0,1]. 

2. For any fixed pi, fa > 0 and fa, the conditional Stage Two acceptance rate a^fapL, a; fa, fa) is a de¬ 
creasing bijection in a from [0, oo) to (f), a 2 \i{pfa-, fa, fa)\. 

The case where fa > 1 is of no interest since in Section 5 we show that when fa > 1 the overall efficiency 
of the algorithm cannot exceed that of the standard PsMRWM. 

4-3. Limiting expected squared jumping distance 

A standard measure of efficiency [SR09, BRS09, Shel3] for local algorithms is the Euclidian Expected Squared 
Jumping Distance (ESJD); see [RR14b, PG10] for detailed discussions. Theoretical motivations for our use 
of the ESJD are given by the diffusion approximation proved in Section 4.4. In our d-dimensional setting, it 
is defined as 


ESJD (d) = E 


v(d) _ v(d) 

^k+1 ^k 


where the Markov chain | (xf, w /j is assumed to evolve at stationarity and || • || is the standard 
Euclidian norm. 

Proposition 4.3. Let Assumptions 1 and 4 hold. For almost every realisation {D'i}i>l °f the sequence of 
auxiliary random variables used to described the deterministic approximation (3.4) to the posterior density 
we have 


lim ESJD {d) = a 12 x (y/ = J(g) (4.6) 

d—> oo Vi/ 

where ol\ 2 is the limiting acceptance rate identified in Proposition 4-1. The dependence of the limiting expected 
squared jumping distance J(g) upon (f3,ww) is implicit. 

4-4• Diffusion limit 

We are motivated to prove that the DAPsMRWM algorithm in high dimensions can be well-approximated 
by an appropriate diffusion limit as this provides theoretical underpinning to our use of the ESJD as measure 
of efficiency [BDM12, RR14b]. The connection between ESJD and diffusions comes from the fact that the 
asymptotic jumping distance lim^oo ESJD* 1 ^ = J(g) is equal to the square of the limiting process’s diffusion 
coefficient and is proportional to the drift coefficient. By a simple time change argument, the asymptotic 
variance of any Monte Carlo estimate of interest is inversely proportional to J(g). Consequently, J(p) 
becomes, at least in the limit, unambiguously the right quantity to optimise. 

It is important to stress that the existence of the diffusion limit in this argument cannot be circumvented. 
MCMC algorithms which have non-diffusion limits can behave in very different ways and ESJD may not 
be a natural way to compare algorithms. The main result of this section is a diffusion limit for a rescaled 
version of the first coordinate process. For time t> 0 we define the piecewise constant continuous time 
process 


V {d) (t) := X (d) 


\_dxtj ,1 


with the notation x[, d) = (xfj,..., x/ d ) £ R d . In general, the process V^ is not Markovian; the next 
theorem shows nevertheless that in the limit d —> oo the process V^ d i can be approximated by a Langevin 
diffusion. 

Theorem 4.1. Let Assumptions 1 and 4 hold. Let T > 0 be a finite time horizon and suppose that for all 
d > 1 the DAPsMRWM Markov chain starts at stationarity, (Xj^, W^) ~ (g) ttw- Then, as d —)• oo, 
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the sequence of processes converges weakly to V in the Skorokhod topology on D([0,T],M) where the 
diffusion process V satisfies the Langevin stochastic differential equation 

dVt = J x ' 2 {p) dB t + i J{p) i\V t ) dt (4.7) 

..... z> 

with initial distribution V$ ~ n. The process B t is a standard scalar Brownian motion. 

Note that, as with Propositions 4.1 and 4.3, the Gaussian Assumption 2 is not necessary for the conclusion 
of Theorem 4.1 to hold. The proof can be found in Section 6.3. Theorem 4.1 shows that the rescaled 
first coordinate process converges to a Langevin diffusion V that is a time-change of the diffusion dVt = 
dB t + \ dt; indeed, t >->• V t has the same law as t Vj^t- This reveals that when speed of mixing is 

measured in terms of the number of iterations of the algorithm, the higher J{p), the faster the mixing of the 
Markov chain. See [RR14a] for a detailed discussion and rigorous results. However any measure of overall 
efficiency should also take into account the computational time required for each iteration of the algorithm, 
and this is the subject of the next Section. 

5. Optimising the efficiency 

When examining the efficiency of a standard RWM the computational time is usually either not taken into 
account or is implicitly supposed to be independent of the choice of tuning parameter(s). In our situations, 
the computational time necessary to produce the stochastic approximation to the target distribution has to 
be taken into account. For this article, we measure the efficiency through a rescaled version of the expected 
squared jump distance, 


. . (Expected Squared Jump Distance) 

(Efficiency) = —--- ; - ; -r. (5.1) 

(Averaged one step computing time) 

For any increasing function & the quantity ^(ESJD)/(Averaged one step computing time) is a valid mea¬ 
sure of efficiency; the discussion at the start of Section 4.4 reveals nonetheless, because of the diffusion 
approximation proved in Theorem 4.1, that (5.1) is the essentially unique measure of efficiency valid in the 
high dimensional asymptotic regime considered in this article. 

When the DAPsMRWM is employed, the average computational time for a single iteration of the algorithm 
is the sum of the average time needed to compute the deterministic approximation 7 ri^ and af 1 times the 
average times it takes to compute the stochastic approximation. Under Assumption 3, the average time 
needed to compute the stochastic approximation is inversely proportional to the variance, <r 2 , of the estimate 
of the log-target; since time units are irrelevant to our discussion, we can assume that this average is exactly 
1 /ct 2 . On the same time scale, we call 77 the average computational time needed to produce the deterministic 
approximation. In other words, the average computational time for a single iteration of the algorithm is 

(Averaged one-step computing time) = 77 + oq /< 7 2 . 


Proposition 4.3 shows that the limiting ESJD equals a 12 x {p/I) 2 where /, defined in Equation (3.3), is a 
constant irrelevant for the optimisation of the efficiency discussed in this section. Following Equation (5.1) 
and eliminating unnecessary constants, the limiting efficiency of the DAPsMRWM can be quantified by the 
following efficiency functional, 


Eff(/q cr 2 ) 


p 2 <7 2 a 12 {p, a) 

pa 2 +ai(p) 


(5.2) 


The dependence upon /3i, /?2 and nw is implicit. Theorem 5.1, which is proved in Appendix A.3, shows that 
the efficiency functional Eff(/u, cr 2 ) possesses intuitive limiting properties: too large or too small a jump size 
and/or stochastic variability in the estimation of the target is sub-optimal. A further property is proved. 


Theorem 5.1. Let the regularity Assumption f, the cost Assumption 3 and the Gaussian Assumption 2 
hold. Suppose further that the Metropolis-Hastings accept-reject function has been used. 

1. For a fixed variance a 2 > 0 we have Eff (p, a 2 ) —» 0 as p 0 or p 00 . 
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t|=10” 3 , p,=-0.5, p 2 =0.6 n=10' 3 , Pi =0.0, 02=0.6 t)=1 0 -3 , p,=0.5, |3 2 =0.6 





Fig 3. Contour plots of the asymptotic efficiency relative to the optimal efficiency of the equivalent pseudo-marginal RWM 
algorithm, Eff re i, as a function of the scaling, p, and the variance of the noise in the log-target, a 2 , for different choices of 
Ph P 2 at rj = 10 — 3 . For comparability, all contours are at 0.5,1,2,3,4,5,6,7,8,10,12,15. The horizontal dashed line denotes 


a 2 = 1. 


2. For a fixed jump size pi > 0 we have E ff(pi, a 2 ) —» 0 as a 2 —> 0 or a 2 —> oo. 

3. For a fixed jump size pi > 0, on the interval a 2 £ (0,6) the efficiency functional a 2 H > FS(pi, a 2 ) has 
no local minima and at most one local maximum. 

Using the same time scale as in (5.2), the equivalent efficiency function for the pseudo-marginal RWM 
is ESpM{p-,cr 2 ) := 2 pi 2 cr 2 & \\J pi 2 + 2cr 2 J, and this is maximised at pipM ~ 2.562 and cr 2 PM ~ 3.283 
[STRR15]. We may therefore define the relative efficiency of the DAPsMRWM algorithm compared with the 
maximum achievable efficiency of the PsMRWM as follows. 


Eff re i(/qcr 2 ) 


Eff (/x, a 2 ) 

Eff pm(/xpm, dp M ) 


(5.3) 


In Appendix A.2 we prove the following, which shows that one would never wish to take fa > 1. 

Proposition 5.1. If fa > 1 then Eff rei (/x, a 2 ) < 1/fa for all pi > 0 and a > 0, provided f^a 2 \i(pi, er; fa, fa) < 
0 for all fixed pi , a and fa > 1. 

We have examined plots of a 2 |i against fa for a large number of combinations of pi, a, fa and have always 
found it to be a decreasing function, but we have been unable to prove that this is always the case. 

In Example 3.1 fa > 1 corresponds to an unrealistic, improper ‘Gaussian’ approximation with positive 
curvature; however, more complex targets and approximations can provide realistic settings for /3i > 1 . 
Consider, for example, a bimodal target and a bimodal approximation but where the local minimum of the 
approximation corresponds to a local maximum of the target. 

For illustration, and somewhat arbitrarily, in this section we deem an increase in the efficiency function by 
a factor of at least 3 as being sufficient to warrant the extra effort in implementing the delayed acceptance 
algorithm. Figure 3 shows contour plots of Eff re i as a function of pi and a 2 for specific combinations of 
fa > 0, \fa\ < fa and r] > 0 chosen so that sup MjCT 2 Eff rel (^i, er 2 ) > 3. Each plot shows a single mode; each 
also shows that for a particular variance, the optimal scaling, pi(a) is insensitive to the value of a , except 
when, approximately, a < 1 , at which point the optimal scaling increases. 

We have been unable to prove that the above points hold for all possible combinations of parameters; 
however, these same points were born out in many other plots that were produced across the range of 
allowable values for fa, fa and p. In [STRR15] and [Shel5]. a similar insensitivity of the optimal scaling, 
/x(tr), albeit valid across the whole range of values for a 2 , is found for pseudo-marginal RWM algorithms 
across a range of distributions for the noise in the stochastic approximation. 

Calculations using the formula for an, the Stage One acceptance rate (4.4), reveal that with pi = 3, a 
10 % increase in scaling leads to a 26% decrease in an when the approximate target is exact {fa = fa = 0 ), 
similar sensitivity is found for other values of \fa\ < fa and this sensitivity increases with the scaling so that 
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DAPM (n-10 1 ) DAPM(n-IO^) DAPM (^-10 3 ; 





Fig 4. Contour plots of the maximum (over fi and a 2 ) asymptotic efficiency of the DAPsMRWM relative to the op¬ 
timal efficiency of the equivalent pseudo-marginal RWM algorithm, Eff re ;, as a function of /\ + /1 2 and fti — @ 2 , for 
V S {1(T\ 10“ 2 ,1CT 3 }. 


when ft = 4 the same relative increase in ft produces a 39% decrease in acceptance rate. Thus, even though 
the optimal scaling may be insensitive to the value of a 2 , sensitivity should be expected in the equivalent 
optimal Stage One acceptance rate. This impacts on our recommended approach to tuning in Section 7.2. 

Figure 4 plots the maximum (over [i and a 2 ) achievable relative efficiency as a function of /% and /3| for 
different values of 77 . Several interesting features are evident. 

1 . As might be expected, an approximation that exactly matches the target produces the most efficient 
algorithm. 

2. For a given fixed /3i, decreasing /? 2 increases the efficiency, and it is optimal to reduce /?2 until /? 2 = /3i |- 
In Example 3.1 this is equivalent to ensuring that the mean of the Gaussian approximation matches 
the mean of the Gaussian target, whilst leaving the variance of the approximation fixed. 

3. For a given fixed P 2 , decreasing |/?i| decreases the efficiency, except for some small values of |/3i|, close 
to the dotted line. In Example 3.1 to keep the mean squared discrepancy in the gradient fixed whilst 
altering the curvature to match the target more closely one must increase the discrepancy from the 
target in the mean of the Gaussian approximation. Roughly speaking, Figure 4 bears out the intuition 
that it is more important that the approximation matches the position of the target than that it 
matches the curvature. Indeed, as born out in the simulation study in Section 7.1, for sufficiently small 
77 values, worthwhile efficiency gains can still be achieved even when /% is far from zero 

4. For fixed /? 2 , with c € [0, /? 2 ], except with very accurrate but computationally expensive targets, the 
efficiency is typically higher when /% = —c compared to when = c: it is better for the approximation 
to have a higher (negative) curvature than the target. When /3\ is negative, the positive correlation 
between S'a and Qa leads to a lower Stage One acceptance rate, the effect of which on the mixing is 
effectively cancelled out by the proportionally lower computational time; however the correlation also 
produces a higher Stage Two acceptance probability, which leads to an overall increase in efficiency. 
This is born out by the simulation study in Section 7.1. 

5. When the calculation of the fast approximation is only 10 times quicker than producing a realisation 
of an unbiased estimator with a variance of 1 (77 = 0 . 1 ) it is impossible to achieve a minimum desirable 
theoretical efficiency gain of a factor of 3. Indeed, calculations reveal that a perfect approximation 
(/?i = P 2 = 0) just achieves the required gain when 77 ss 0.033. Thus, there is little point in using an 
approximation unless it is at least 30 times quicker to evaluate than the pseudo-marginal estimate is 
when a 2 = 1 . 

Plots of the scaling that achieves the optimal efficiency, /t, against /% + ft 2 and /3i — @2 (not shown) possess 
contours with a similar shape to the contours in Figure 4, with /t increasing as P 2 decreases; also, as with 
efficiency itself, for fixed /?i and P 2 , R increases with decreasing 77 , though not as markedly as efficiency. For 
a fast approximation, a large scaling leads to many rejections at Stage 1 but since these rejections are cheap, 
the computational time penalty is small; if the approximation is accurate then once a proposal is accepted 
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Fig 5. Contour plots of the maximum (over fi) asymptotic efficiency of the DARWM relative to the optimal efficiency of the 
equivalent RWM algorithm, Eff re i, as a function of fi i + ft 2 and, f)\ — @2, for 77 £ {10 1 , 10 -2 ,10 3 }. 


at Stage One it is likely to also be accepted at Stage Two, leading to a bigger move with little increase in 
computational time. 

Plots of the optimal variance, a 2 , showed a gradual decrease in a 2 with decreasing fa, indicating that to 
realise the potential of an accurate approximation the Stage Two acceptance rate should not be dominated by 
noise in the unbiased estimate of the target. However, as a consequence of Lemma 4.1 the relative innaccuracy 
of the deterministic approximation to the Metropolis-Hastings ratio increases with the scaling, ft. Very small 
g values are associated with large fi values and, since in the Stage Two acceptance ratio there is little to be 
gained from making the unbiased estimate much more accurate than the deterministic approximation, the 
optimal a 2 values do not decrease as substantially as might otherwise be expected. 

To place actual numbers on the above points: for g £ {10 —2 ,10 -3 ,1CV 6 }, empirically, we found that in 
regions where an increase in relative efficiency of at least a factor of 3 is achievable, 7 > fi > 4 > 2.56 ss fipM 
and 1.4 < a 2 < 3 < 3.28 ~ <7p M . 


5.1. Tuning the Delayed Acceptance RWM algorithm, 


In the case of the DARWM there is no stochastic estimate of the posterior since it is known precisely. The 
bounded convergence theorem applied to (4.5) shows that in the limit as a 2 -A 0, 


a i 2 (/b 0) = E 


G —-(1 — fa)p 2 + p (fa - fa/fa) £, p 2 (l - fai/fil) 




(5.4) 


With this expression for the overall acceptance rate, the efficiency function (5.2) becomes 


ES(p) 


ft 2 0 : 12 ( 77 ,0) 
V +oti(p) ’ 


(5.5) 


where g is now the ratio of the computational times required to calculate the cheap approximation to the 
posterior and the expensive exact posterior. 

Using the same timescale, the efficiency of the RWM is ESpwM(ft) := 2y 2 <&(—y/2) [RGG97], which is 
optimised at p = fiptwM ~ 2.38. We may therefore define the relative efficiency of the DARWM algorithm 
compared with the optimal efficiency of the RWM algorithm 


Eff re i(/z) := 


Efffc) 

E&kwm (rrwm) 


(5.6) 


Plots of Eff re i against p, for different values of fa and fa (not shown) reveal a single mode, indicating 
that, as with the DAPsMRWM, there is an optimal choice of scaling parameter. Figure 5 plots the maximum 
(over p) achievable relative efficiency as a function of fa and fa for different values of g. The features of 
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these plots are broadly similar to those of the plots for the relative efficiency of the DAPsMRWM, however, 
an approximation that is 10 times quicker to compute than the true posterior is just sufficient for the overall 
efficiency to increase by a factor of 3 provided the approximation is very accurate. 


6. Proofs 

It will be helpful to introduce i.i.d sequences {Xj}j>i and {Pi}i>i respectively marginally distributed as 7r 
and 7Tr, and corresponding realisations of them, {xi}i >i and { 7 ;}i>i ■ Similarly, we consider an i.i.d sequence 
{Zi,k}i,k> o °f standard Gaussian N (0,1) random variables, {Uk}k>o an i-i.d sequence of random variables 
uniformly distributed on (0,1), W a random variable distributed as Ttw and {W/}k>o an i.i.d sequence 
distributed as Ttw*- For any dimension d > 1 we set = (X\,...,Xd) £ R d and = W and 

xjfj’* = xj^j + (ft/I) dr 1 !' 1 Zkj; we recursively define 



for a proposal X^’* = (X ^’*,..., X12’*)- Indeed, the process (X.\ d \wf, d ' > ) is a DAPsMRWM Markov chain 
started at stationarity and targeting tt^ (g) Ttw- We denote by Tk the cr-algebra generated by the family of 
random variables {x| d \ \ t < fc} and use the notation E*,[ • ] for designating the conditional expectation 

E[- | Fk\- Similarly, we use the notation E XiW [-| instead of E[- | (Xq^Wo^) = (x, u>)]. 


6.1. Proof of Lemma f.l 


The Law of Large Numbers and the separability of L 1 (Tt®Ttr) readily yield that for almost every realisations 
{xi}i>i and {7;}i>i, the following holds, 

lim n 

n—>oo 

We can thus safely assume in the remainder of this section that Equation (6.1) holds for the realisation 
{7i}i>i °f the auxiliary random variables used to describe the deterministic approximation (3.4) . By the 
Cramer-Wold device, for proving Lemma 4.1 it suffices to establish that for any coefficient cq,cs £ R. the 
sequence cq (x.^) + cs ^^(x^) converges in law towards cq Q^+cs S the boundedness assumption 
on the derivatives of the functions x K > l(x) and x <—> S(x, u) and a second order Taylor expansion show that 
this is equivalent to proving that the sum 

[ CQ I'( x i) +c s d x S(xi,fi)} Zi + - ^2 { C Q I"( x i) +c s dxxS(xi,fi)} 

v i —1 i =1 

converges in law towards cq Q ^ + cs . Definition (3.5) of the coefficient /3i and /I 2 yields that for almost 
every realisation {xi}i >i and { 7 i}i>i we have 


^2 <p( x i, 7 i) = J f>(x, 7 ) (7T (g) 7Tr) (dx, df) for all £ i 1 (7r ® 7rr). 


( 6 . 1 ) 


d 

i—1 


(^'(Xi) 2 , I"(x-i), d x S(x i ,fi) 2 1 d xx s(x i ,f i ), l'(x i )d x s(x. i ,f i ) s j ->• (l,-1 , 0$, Pi,-fii), (6.2) 


from which the conclusion directly follows since cq Q ^ + cs S “ has a Gaussian distribution with mean 
ft 2 (csP 1 — cq)/ 2 and variance ft 2 ( Cq + c| / 3 f — 2 cq cs /?i). 
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6.2. Proof of Proposition f.3 

The quantity ESJD^ can also be expressed as 


ESJD (d) = 


Pd 


E E (^j d) ) 2 x F (Qa° + Sa ] ) x f ( w a ] - s '< 


i'=i 


= E e 


(. z[ d) Y x F ( Q^ x F (w^ ] - S i d) ) 


for Q (d) , S (d \ W-'E defined in (3.11); the second equality follows from the exchangeability, at stationarity, 
of the d coordinates of the Markov chain. One can decompose and S E as a sum of a term that is 
independent of z [ d ^ and a negligible term; we have Q E = Q^’ ± + log [ 7 r(X^’*)/ 7 r(X^)] and = 
+«S(X^’*, 7l ) -5 (X^,ti) with 


Qa 1± = E lo S [ 7 r(xf’*)/ 7 r(xf)] and S% ± = E ^(xf 7j ) - S(xf , 7i ). 


1=2 


1=2 


Note that Q E’"* - and S < £' >,± are independent of z[ d K Under Assumption 4, the moments of order two of the 
differences qE — qE'^ and S^ 1 — S^’ ± are finite and converges to zero as d — 1 oo. The Cauchy-Schwarz 
inequality and the fact that F is bounded and Lipschitz and bounded yield that ESJD ^/(p/I) 2 can also 
be expressed as 


E 


Zi 


(d) 


x F 


(qE ,j 


+ EE) x F ( 


w 


(d) 


_ q(d) 


■ x ) 


■E 


■E 


(E) x F (q < £ >,± + EE) x {e (w£ ] - E d) ) - F - EE) } 


(. Z[ d) y~ x {e ( Q (d) + S«>) - E (<EE + EE) } x E ( 


'w/E ~E?' 


= E 


F (qE ,X + EE) x ^ ( 


wE - EE 


’ ± )] +o(l) 


= E [E (Q£ +®xE (W A - S£)] + o(l) = a 12 + o(l), 

as required. We have used the fact that for almost every realisation of the auxiliary random variable {r ? - } 7 >i 
the sequence ^QE’E EE"*") converges in distribution to {Q^,S^), which readily follows from Lemma 4.1. 


6.3. Proof of Theorem f.l 


The proof is a generalisation of the generator approach of [RGG97, Bed07] coupled with an homogenization 
argument. We introduce the subsampled processes X/ d ) and W^ defined by 


-yW ■y" G) 

k - A kxTW 


and 


M/( d ) _ My( d ) 
vv k ~ VV kxTW 


for an intermediary time scale defined as T ^ = |_d 7 J where 7 is an arbitrary exponent such that 7 £ ( 0 ,1/4). 
One step of the process X^ (resp. W^ d> ) corresponds to steps of the process (resp. W^). We 
then define an accelerated version W d ) of the subsampled process X^ d \ In order to prove a diffusion limit 
for the process X^ d \ one needs to accelerate time by a factor of d ; consequently, in order to prove a diffusion 
limit for the process X ^, one needs to accelerate time by a factor d/P d ' > and thus define V ^ by 

[ td/T (d) J,l* 

The proof then consists of showing that the sequence V^ converges weakly in the Skorohod topology towards 
the limiting diffusion (4.7) and verifying that ||oo,[0,T] converges to zero in probability; this is 
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enough to prove that the sequence converges weakly in the Skorohod topology towards the limiting 
diffusion (4.7). We denote by Jz? the generator of the limiting diffusion (4.7). Similarly, we define 
and the approximate generators of the first coordinate processes xf* and xf 1 : for any smooth and 

compactly supported test function p : R —>■ R, vector x = (aq,..., xf) £ R d and scalar x, w £ R we have 

[ ^p(x,w) =E x , w [p(x[ d l)-p(4 d l)]/S 

\ J?Wp(x,w) = E XtW [p(X$)-p(X$)]/(TW x6) for 5= 1/d 
[ Jf’p(x) = x {i'{x)p'{x) + p"{x)). 

Note that although p is a scalar function, the functions Af^p and AA^p are defined on R d x R. The law 
of iterated conditional expectation yields the important identity between the generators and Jz ?(^, 


^ d) p{y:,w) 


T{d) 


E* 


r p( d ) 




k—0 


(6.3) 


For clarity, the proof of Theorem 4.1 is divided into several steps. 


6.3.1. The finite dimensional marginals ofV d converge to those of the diffusion (4.7) 


Since the limiting process is a scalar diffusion, the set of smooth and compactly supported functions is a core 
for the generator of the limiting diffusion ([EK86], Theorem 2.1, Chapter 8); in the sequel, one can thus work 
with test functions belonging to this core only. Because the processes are started at stationarity, it suffices to 
show ([EK86], Chapter 4, Theorem 8.2, Corollary 8.4) that for any smooth and compactly supported function 
p : R —► R the following limit holds, 


lim E 

d—f oo 


Jz? (d V(Xi ,...,X d ,W)-&ip(X{) 


= 0 . 


(6.4) 


The proof of Equation (6.4) spans the remaining of this section and is based on an asymptotic expansion 
that we now describe. For every x, w £ R we define the approximated generator Ap : R x R —► R by 


Ap(x,w) = (y) {a(w)£'(x) + (\®i 2 + [A(w) - B(w)\ d x S(x,j 1 )) <p (x) 


(6.5) 


where A, B : R —> (0; oo) are two bounded and continuous functions defined by 


i A(w) = E[F'(Q^ + S^)xF(W*-w~S^) 
\ B(w) = E[F(Qa + Sa) x F '(W* -w- S%j 


( 6 . 6 ) 


for W* ~ 7 t w * and F'(u) = e“I u<0 and as defined in (4.1). The functions A, B : R —> R + are 

such that 

E[A(W)] = E[B(W)} = ^a 12 . (6.7) 


The proof of (6.7) can be found in Appendix A.4. It follows from (6.7) that for any fixed x £ R we have 


E [Ap{x,W)\ = AAp(x) 


( 6 . 8 ) 


for a random variable W ~ t^w ■ 

Lemma 6.1. Let Assumptions f hold. We have 


J? {d) p(X 1, ...,X d ,W)- Ap{X u W) 


lim E 

d—¥ oo 


= 0 . 


(6.9) 
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The proof of Lemma (6.1) consists in second order Taylor expansion and an averaging argument; details 
are in Section A.5. For proving Equation (6.4), note that identity (6.3) and Jensen’s inequality yield the 
quantity inside the limit described in Equation (6.4) is less than two times the expectation of 



The expectation of the first term is less than E ..., Xd, W) — Ap(Xi, W) and Lemma (6.1) 

shows that this quantity goes to zero as d —> oo. To finish the proof it thus remains to verify that the 
expectation of the second term also converges to zero; to prove so, note that the second term is less than 
two times 


T (d) 

Z-^k—0 

Av{X%l , Wi d) ) - A<p(X $, Wi d) ) 

2 

f£*=o^ 

[4%wi d) ) 


(<l) 

1 

T(d) 

TA) f 


( 6 . 10 ) 


Under the assumptions of Theorem 4.1, it is straightforward to verify that the function Ap is globally 
Lipschitz in the sense that there exists a constant ||„4<p||Lip such that for every X\,X 2 ,w £ R we have 
\A<f>(x\, w) — Atp(x 2 ,u>)\ < ||*4c^||Lip x |xi — X 2 I; it follows that the expectation of the first term in (6.10) 
converges to zero. For proving that the second term also converges to zero, we make use of the following 
ergodic averaging Lemma whose proof can be found in Section A.6. 

Lemma 6.2. Let h : R —> R be a bounded and measurable test function. We have 


lim E 

d—f 00 


e£o 


(d) 

k 


T(d) 


- E [h(W)] 


= 0 , 


for a random variable W ~ itw independent from any other sources of randomness. 

Identity (6.8), a standard conditioning argument and Lemma 6.2 yield that the expectation of the second 
term in Equation (6.10) also converges to zero; this finishes the proof of the convergence of the finite 
dimensional marginals of V d to those of the limiting diffusion (4.7). 

6.3.2. The sequence V d converges weakly towards the diffusion (4.7) 

The finite dimensional marginals of the sequence process V d converges to those of the diffusion (4.7). To prove 
that the sequence V d actually converges to the diffusion (4.7), it thus suffices to verify that the sequence V d 
is relatively weak compact in the Skorohod topology: since the process V ^ is started at stationarity and 
the space of smooth functions with compact support is an algebra that strongly separates points, ([EK86], 
Chapter 4, Corollary 8.6) states that it suffices to show that for any smooth and compactly supported test 
function ip the sequence d E| ..., Xd, W) | is bounded. Equation (6.4) shows that it suffices to 

verify that E 2z?<^(X) < 00 for X ~ it, which is obvious since p is assumed to be smooth with compact 

support. 


6.3.3. The sequence V d converges weakly towards the diffusion (4.7) 


Because the sequence V d converges weakly to the diffusion (4.7), it suffices to prove that the difference 
\\V d - V d \\ 00i [o,T] goes to zero in probability. To this end, it suffices to prove that the supremum 


sup 


{ 


Y'( < ^) _ v'( £ ^) 

kT( d )+i, 1 kT( d ), 1 


k x T (d) < d x T, i < T (d) j 
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(d) 

kTW,l 


is less than a constant times 


converges to zero in probability. Since ^ — X 

{ \ZkT( d /l | + • ■ ■ + |'Z(k+l)T(‘D-l,l|}) 

standard Gaussian concentration gives the conclusion. This ends the proof of Theorem 4.1. 


7. Practical advice and simulation studies 

Parts 1 and 2 of Theorem 5.1 suggest that our goal of finding values a 2 > 0 and g, > 0 that optimise the 
efficiency is a sensible one. Standard timing diagnostics can provide a straightforward estimate of the relative 
cost, g. Given information on scaling, quality of the stochastic approximation, and acceptance rates from a 
number of preliminary runs, it might also be possible to estimate the accuracy parameters, 0\ and 02 , and 
the roughness constant I. The efficiency functional ((5.2) or (5.5)) could then be optimised and the algorithm 
tuned to achieve the optimal acceptance rates, or the parameters set directly from estimated optimal ft and 
cr 2 . Tuning to particular acceptance rates seems unwise given the sensitivity to the system parameters that 
was observed in Section 5. Moreover, the expression (5.2) for the efficiency functional is the result of an 
idealisation and is unlikely to represent reality precisely. We therefore opt for a more robust approach to 
tuning, based upon the more general features of (5.2). 

We first test the predictions of our theory through a short simulation study on a combination of target and 
approximation where the values of 0\ and 02 can be varied and calculated. We then provide some practical 
advice and verify this and some other predictions from the theory on a simulation study on a real statistical 
example. 


7.1. Simulation study on a Gaussian target using a logistic approximation 


We consider a scenario where the true target is a product of standard Gaussians and the deterministic 
approximation is a product of logistic densities with a mode at ip\ and inverse-scale parameter f> 2 , 


r(z) «exp \ - ^ x 2 


.dL e V2(xi-y>i) 

and E a (x) cx: 1 I - f 

^(1 + e<P*(xi-'Pi)y 


We consider (^r, <p 2 ) e U := {(0,0.6), (0,1.2), (0,1.8), (0,2.3), (0,2.7), (1,1.2), (0.6,1.8), (0.5, 2.3)} and 
examine both the DAPMRWM and the simpler DARWM which, in this section, we further abbreviate to 
DAPM and DA respectively. For the pseudo-marginal tests we impose Assumptions 2 and 3, artificially 
creating W* ~ N (— ^a 2 ,a 2 ) and imagining that the computational cost of obtaining Tr(x*)e w is inversely 
proportional to a 2 . 

We measure the mixing efficiency for each algorithm in terms of the empirical effective samples size (ESS), 
obtained using the initial monotone sequence method of [Gey92]. Since the components are exchangeable 
we record the (harmonic) mean over the d components to reduce Monte Carlo variability. We look at the 
efficiencies of each algorithms relative to the equivalent non-DA algorithm. ESS* denotes the ratio of effective 
sample sizes, and ESS** denotes the empirical equivalents of (5.3) and (5.6), the relative overall efficiencies 
taking CPU time into account, for a particular value of g (see below); here g 2 a± 2 is replaced with the ESS. 

The scalings for the non-DA algorithms were A* = 2.56 /\fd (PMRWM) and A* = 2.38 /\fd (RWM); these 
are the optimal scalings suggested in [STRR15] and [RGG97], respectively. In tests, the efficiency of the 
PMRWM algorithm varied by less than 10% over the range cr 2 £ [1.5, 3]; nonetheless, the optimum efficiency 
occured at at cr 2 « 2 (rather than the value of 3.3 recommended in [STRR15]) and so all relative efficiencies 
for DAPM algorithms are relative to the PMRWM algorithm with a 2 = a 2 := 2. For each (<pi,(p 2 ) € %, 
experiments for each DA algorithm were repeated for each scaling value in {A*, 1.5A*, 2A*, 2.5A*}. For DAPM 
algorithms each of the above combinations was run for each noise variance in {1, 2,3}. All runs used d = 10, 
were started from a point in the main posterior mass and continued for 10 7 iterations. 

Clearly both the target and the approximation are computationally very cheap to evaluate. We artificially 
induce a relative speed by considering three different values of g: 0.01, 0.001 and 0; the last of these provides 
the theoretical optimal relative efficiency for that approximation. 
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Algorithm 

<Pi 

<P2 

Pi 

h 

Qi 

®2|1 

ESS * 

TT'QQ** 

TT'QQ** 

^^O.OOl 

ESSJ* 

RWM 





0.2616 


1 




DA 

0.0 

0.6 

0.834 

0.834 

0.261 

0.128 

0.275 

1.20 

1.25 

1.25 

DA 

0.0 

1.2 

0.441 

0.449 

0.069 

0.533 

0.269 

4.33 

5.06 

5.16 

DA 

0.0 

1.8 

-0.042 

0.262 

0.041 

0.738 

0.213 

5.30 

6.83 

7.06 

DA 

0.0 

2.3 

-0.467 

0.649 

0.034 

0.595 

0.141 

3.99 

5.35 

5.57 

DA 

0.0 

2.7 

-0.810 

1.025 

0.032 

0.492 

0.103 

3.12 

4.28 

4.46 

DA 

1.0 

1.2 

0.535 

0.762 

0.140 

0.151 

0.173 

1.15 

1.23 

1.24 

DA 

0.6 

1.8 

0.056 

0.681 

0.0650 

0.279 

0.142 

1.90 

2.16 

2.19 

DA 

0.5 

2.3 

-0.351 

0.941 

0.049 

0.289 

0.109 

1.83 

2.15 

2.20 

PMRWM 





0.1155 


1 




DAPM 

0.0 

0.6 

0.834 

0.834 

0.2196 

0.066 

0.193 

0.80 

0.87 

0.88 

DAPM 

0.0 

1.2 

0.441 

0.449 

0.0522 

0.230 

0.157 

2.17 

2.89 

3.00 

DAPM 

0.0 

1.8 

-0.042 

0.262 

0.0311 

0.286 

0.123 

2.41 

3.71 

3.95 

DAPM 

0.0 

2.3 

-0.467 

0.649 

0.0252 

0.255 

0.093 

2.06 

3.42 

3.70 

DAPM 

0.0 

2.7 

-0.812 

1.025 

0.0232 

0.223 

0.073 

1.69 

2.90 

3.15 

DAPM 

1.0 

1.2 

0.535 

0.762 

0.1161 

0.069 

0.107 

0.79 

0.91 

0.93 

DAPM 

0.6 

1.8 

0.056 

0.681 

0.0509 

0.129 

0.081 

1.14 

1.53 

1.59 

DAPM 

0.5 

2.3 

-0.351 

0.941 

0.0387 

0.135 

0.067 

1.15 

1.66 

1.74 


Table 1 

Output from the simulation study of a Gaussian target using a logistic approximation when the DA scaling is 2A* and, for 
the DAPM, a 2 = a 2 . ESS* denotes the effective sample size relative to the corresponding non-DA algorithm and 
ESS** ( r) G {0.01,0.001,0}) denotes the relative overall efficiency compared with the corresponding non-DA algorithm. 


Table 1 corresponds to DA runs with a scaling of 2A* and (PM runs) a variance of a 2 ; it provides the 
values of /3, and /3 2 for each value of (ipi, ip 2 ), relative ESSs for each run and the relative overall efficiencies. 
Both for the DAPM and the DA algorithms, the best gain in efficiency is obtained when the location and 
curvature of the approximation closely match that of the target (ipi =0, ip 2 = 1.8); however, provided the 
approximation is sufficiently fast, substantial efficiency gains can also be obtained when the curvature of the 
approximation does not match that of the target. It is also clear that, for similar values of /3 2 and |/3i|, it is 
better for /3\ to be negative rather than positive: an approximation that is more (negatively) curved than 
the target is, apparently, preferable to one that is less curved, as suggested by the theory in Section 5. 

For each run, given the values of /?i and /3 2 we also obtained the Stage One and conditional Stage Two 
acceptance rates predicted by our theory in Section 4, as well as the predicted relative efficiencies when 
= 0 : 

<7 2 [i 2 a 2 \i(HiV 2 ',Pi,(32) n 2 a 2 \i{n,cr‘ 2 = Ojffi,#;) , . 

ES RW m(i^*) 

for the DAPM and DA algorithms respectively, and where the efficiency functions for the PMRWM, Eff pm, 
and the RWM, Eff rwm are defined in Section 5. Here, n is the scaling for the DA or DAPM algorithm and 
<t 2 is the variance of the noise in the log-target for the DAPM algorithm. 

Figure 6 plots the predicted values against the truth. For the acceptance rates our theoretical formulae 
perform well, generally slightly underpredicting the values both at Stage One and Stage Two. In terms of 
relative efficiency the theory performs well for small gains in efficiency but it tends to overestimate the gain in 
efficiency for larger gains; the problematical points for DA and for DAPM all correspond to the larger scalings 
(2A* or 2.5A*). When the scalings are larger each component is further from the limiting diffusion on which 
the theory is based. We have considered a very simple target where, even for the (non-DA) RWM, a relatively 
large scaling can be used. For a more complex and realistic target a (relatively) smaller scaling would be 
needed and so we would expect the theoretical prediction of efficiency to be more accurate; certainly, little, if 
any, over-prediction is observed in the real statistical target in Section 7.3. Even in this simple example, the 
theory does, however, produce roughly the correct ordering of relative efficiencies for a given deterministic 
approximation, which suggests that predictions of optimal parameter values could be trusted even if the 
resulting predicted efficiency may not be. 

7.2. Practical advice 


We first summarise the points that will define our strategy. 
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®2|i (DA) 


o (0.0,0.6) 
A (0.0,1.2) 
+ (0.0,1.8) 
x (0.0,2.3) 
O (0.0,2.7) 
V (1.0,1.2) 

a (0.6,1.8) 
* (0.5,2.3) 


0.1 0.2 0.3 0.4 

predicted 


$ « / 


x (0.0,2.3) 
O (0.0,2.7) 
V (1.0,1.2) 



+ 

+ 

/ + 






* 

,x O O 

.6 

,+ 

0 (0.0,0.6) 


A 

A (0.0,1.2) 
+ (0.0,1.8) 



x (0.0,2.3) 
O (0.0,2.7) 


y 

V (1.0,1.2) 


a (0.6,1.8) 
* (0.5,2.3) 


v. p 




10 12 


a 2 |, (DAPM) 


< (0.0,2.3) 
> (0.0,2.7) 
7 (1.0,1.2) 


/ 


0.1 0.2 0.3 0.4 

predicted 


+ + 1-- 


" * 


A 

° o 

* + AA 


: (0.0,2.3) 
> (0.0,2.7) 
' (1.0,1.2) 


0.2 0.3 

predicted 


x (0.0,2.3) 
O (0.0,2.7) 

v ( 1 . 0 , 1 . 2 ) 


10 12 


Fig 6. Plots of empirical estimates of a±, q 2 |i an d efficiency relative to the corresponding RWM algorithm against their 
theoretical values. The diagonal dashed line is y = x. Each symbol represents a different combination of (^ 1 ,^ 2 )/ for each of 
these combinations, runs were performed with scalings of A, 1.5A, 2A and 2.5A. For the DAPM, for each of these scalings, 
runs were performed with cr 2 G {1, 2, 3}. 
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1. For a 2 > 1 the optimal value of p is relatively insensitive to a 2 (Figure 3). 

2. At no point where Eff re i > 3 was a 2 found to be outside of the interval [1.4,3.3] (end of Section 5). 

3. These findings are subject to the assumptions that a 2 is independent of x, and are true in the limit as 
d —> oo. Any optimisation method should allow for these only being approximations to the truth. 

The key observation is Point 1; this allows us to transform the problem from a two-dimensional optimi¬ 
sation into two one-dimensional optimisations. 

To improve the efficiency of an RWM algorithm it is usually advisable to make the jump proposal matrix 
reflect the overall shape of the posterior (e.g. [RR01]). One frequently used strategy for the RWM (e.g. 
[SFR10] is to set the proposal covariance matrix to be proportional to an estimate of the target covariance 
matrix from a preliminary run. 

We therefore assume that the practitioner has performed a preliminary run and, in addition to obtaining 
an approximate covariance matrix, V := Var(X), some representative value, such as an approximate 
posterior mode, median or mean, has also been ascertained. A small number of further values from the 
approximate posterior, x^ 1 ),..., x.( k \ should also be noted. Following the preliminary run, our strategy is: 

1. By running the DAPsMRWM algorithm with the scaling set to zero, A = 0, and recording the estimated 
log-likelihood, find the number of particles, m* , such that for all x^°\ ..., (ideally) a 2 £ [1.4, 3.3] 
or (at least) a 2 > 1 

2. Perform several runs of the DAPsMRWM algorithm with m* particles, and find the scaling, A, that 
optimises the efficiency. 

3. Perform several runs of the DAPsMRWM algorithm with scaling set to A and find the number of 
particles, m, that optimises the efficiency. 


7.3. Simulation study: application to Markov jump processes 


To illustrate the theory, we consider a Lotka-Volterra predator-prey model (e.g. [BWK08]). The model 
describes the continuous time evolution of Uf = t/ 2j t) where U\ tt (prey) and £/ 2j t (predator) are non¬ 

negative integer-values processes. Starting from an initial value, which is assumed known for simplicity, IR 
evolves according to a Markov jump process (MJP) parameterised by rate constants c = (ci,c 2 ,C 3 ) and 
characterised by transitions over (t, t + dt ] of the form 


P {Ul^t+dt = Ul,t + 1, U2,t+dt = U2,t\ u l,t, U 2 ,t) 

P {U\,t+dt = Ul,t — 1, U2,t+dt — U2,t + l|wi,i, «2,t) 
P {Uij+dt = wi,t, CR,t+dt = U2,t ~ l|rti,t> U2,t) 


C\U\jdt + o(dt), 

C2Ul, t U2,tdt + o{dt), 

c 3 u 2 , t dt + o(dt). 


The process is easily simulated via the Gillespie algorithm [Gil77] and the PMRWM scheme is straightforward 
to apply (see [GW11] for a detailed description). 

We assume that the MJP is observed with Gaussian error every time unit for n time units, t = 1,..., n: 


Y t 


v 


N 


Ul,t 

U 2 ,t 



As all of the parameters of interest must be strictly positive, we consider inference for 


x = (log(ci),log(c 2 ),log(c 3 ),log(si),log(s 2 )). 

The DAPsMRWM scheme requires that a computationally cheap approximation of the MJP is available. We 
follow [GHS14] by constructing a linear noise approximation (LNA) (see e.g. [vKOl]). Under the LNA 

U t ~ N (z t + m f , V t ) 


where z t , m t and V f satisfy a coupled ODE system 


i t = S h(z t , c) 

rh( = F t rip 

Vt = V t F? + Sdiag{h(z t ,c)}S T + FtVt 


(7.2) 
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m 

80 

100 

150 

200 

250 

300 

500 

800 

2000 

7 

a 2 

8.30 

5.86 

3.53 

2.52 

1.83 

1.50 

0.89 

0.52 

0.20 

1 

mESS/s 

0.0750 

0.0808 

0.0810 

0.108 

0.118 

0.119 

0.119 

0.113 

0.0661 


Oil 

0.256 

0.255 

0.257 

0.255 

0.257 

0.254 

0.254 

0.255 

0.258 


“2|1 

0.0651 

0.0883 

0.170 

0.237 

0.289 

0.341 

0.447 

0.547 

0.692 

2 

mESS/s 

0.140 

0.218 

0.296 

0.289 

0.319 

0.278 

0.262 

0.181 

0.127 


Oil 

0.0556 

0.0514 

0.0489 

0.0503 

0.0517 

0.0520 

0.0513 

0.0517 

0.0505 


«2|1 

0.0619 

0.0895 

0.163 

0.213 

0.286 

0.313 

0.438 

0.522 

0.674 

2.5 

mESS/s 

0.142 

0.226 

0.338 

0.381 

0.325 

0.318 

0.330 

0.282 

0.142 


Oil 

0.0244 

0.0237 

0.0234 

0.0259 

0.0264 

0.0234 

0.0241 

0.0230 

0.0250 


“2|1 

0.0600 

0.0815 

0.159 

0.218 

0.252 

0.312 

0.434 

0.523 

0.675 

3 

mESS/s 

0.160 

0.294 

0.364 

0.441 

0.401 

0.419 

0.364 

0.277 

0.156 


Oil 

0.0143 

0.0123 

0.0119 

0.0114 

0.0131 

0.0120 

0.0114 

0.0124 

0.0121 


“2|1 

0.0416 

0.101 

0.152 

0.233 

0.274 

0.320 

0.426 

0.516 

0.673 

3.5 

mESS/s 

0.107 

0.225 

0.331 

0.402 

0.374 

0.390 

0.348 

0.307 

0.162 


Oil 

0.00629 

0.00789 

0.00763 

0.00684 

0.00669 

0.00663 

0.00725 

0.00634 

0.00694 


“211 

0.0550 

0.0869 

0.170 

0.237 

0.273 

0.312 

0.424 

0.534 

0.673 

4 

mESS/s 

0.107 

0.174 

0.176 

0.291 

0.308 

0.319 

0.351 

0.292 

0.162 


Oil 

0.00343 

0.00318 

0.00401 

0.00388 

0.00372 

0.00357 

0.00377 

0.00402 

0.00418 


“2|1 

0.0680 

0.105 

0.151 

0.215 

0.287 

0.310 

0.407 

0.500 

0.681 

4.5 

mESS/s 

0.0728 

0.159 

0.150 

0.267 

0.310 

0.300 

0.300 

0.258 

0.153 


Oil 

0.00220 

0.00183 

0.00207 

0.00247 

0.00230 

0.00256 

0.00224 

0.00249 

0.00226 


«2|1 

0.0527 

0.111 

0.143 

0.213 

0.265 

0.280 

0.424 

0.491 

0.658 


Table 2 

Minimum effective sample size (mESS) per second, stage 1 acceptance probability 6 l\ and stage 2 acceptance probability a 2 \i 
as functions of the number of particles m and scaling 7. The variance (a 2 ) of the estimated log-posterior at the median is 

also shown for each choice of m. 


For the Lotka-Volterra model, the rate vector h(z^,c), stoichiometry matrix S and Jacobian matrix Ft are 
given by 

h(z t ,c) = (CiZi,t,C2Z ltt Z2,t,C 3 Z2,t), 


S = 


1 

0 


-1 

1 



F t 


Ci - C 2 Z 2 ,t 
C 2 Z 2 ,t 


~C 2 Zi : t \ 
C2Zl, t - C 3 ) 


Appendix B describes an algorithm for evaluating the posterior (up to proportionality) under the LNA. For 
further details regarding the LNA and its use as an approximation to a MJP, we refer the reader to [FGS14] 
and [GHS14]. 

Data were simulated using an initial value Uo = (71, 79) for n = 50 time units with c = (1.0,0.005, 0.6) and 
Si = s 2 = 8. These parameters were assumed to be independent a priori with independent proper Uniform 
densities on the interval [—8,8] ascribed to A}, (i = 1,... ,5). An initial run of the DAPsMRWM scheme 
provided an estimate of a central value x (the posterior median) and the posterior variance matrix Var(X). 
For a PMRWM scheme [STRR15] suggest that the scaling should be V prop = x Var(X) to optimise 

efficiency. We therefore applied the DAPsMRWM scheme with scaling V prop = y 2 x ^ 2 "]( 6 ^ x Var(X). To find 
the optimal scaling 7 and number of particles ro, we followed the 3-step strategy of Section 7.2. Specifically: 


1. The approximate posterior median and 4 additional samples were recorded, based on the initial run. 
Running the DAPsMRWM algorithm for all sampled parameter values with 7 = A = 0 and m = 200 
particles gave a 2 € [2.04,3.56]. We therefore took to* = 200. 

2. Further short runs (of 5 x 10 4 iterations) with to* = 200 particles and 
7 € {1,2,2.5,3, 3.5,4,4.5} gave 7 « 3. We therefore took 7 = 3. 

3. Additional short runs of the DAPsMRWM algorithm with 7 = 3 and to G {100,150, 200, 250, 300} gave 
to = 200 . 

To confirm that the practical advice is reasonable and to test some of the predictions of our theory, the 
number of particles to was varied between 80 and 2000 and, for each to, the scaling 7 was varied between 
1 and 4.5. For each (m, 7) pair, a long MCMC run (of at least 4 x 10 5 iterations) was performed. Table 2 
shows empirical efficiency, measured in terms of minimum (over each parameter chain) effective sample size 
per second, as well as Stage 1 and Stage 2 acceptance rates. Figure 7 shows empirical efficiency as a function 
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Fig 7. Empirical efficiency measured as the effective sample size per CPU second. The left-hand panel gives the efficiency 
plotted against 7 for various numbers of particles. The right-hand panel gives the efficiency plotted against cr (estimated at the 
posterior median), for various scalings. 


of the scaling 7 (with a varying number of particles m) and as a function of the number of particles (for 
various scalings 7 ). Table 2 suggests that for the values of 7 and m considered, 7 = 3 and to = 200 are 
optimal in terms of minimum ESS per second. Proposition 4.2 proves that, subject to assumptions, the 
Stage 2 acceptance probability decreases as the variance in the log-posterior (cr 2 ) increases and the Stage 
1 acceptance probability decreases as the scaling increases; the fact that these patterns are observed in our 
experiments (see Table 2) provides empirical evidence that our assumptions are reasonable. Further empirical 
evidence is provided by the insensitivity of the optimal choice of scaling, 7 , to the value of cr 2 , for values 
of cr 2 >= 0.89, as suggested by Figure 7. For cr 2 < 0.89 the optimal scaling increases, as predicted by the 
theory. 

Finally, we compare the performance of the DAPsMRWM scheme against a PMRWM scheme. Following 
the practical advice of [STRR15], we ran the PMRWM scheme with m = 200 particles for 2 x 10 5 iterations 
with a scaling tuned to give an acceptance rate of around 10% (which required 7 = 0.9. This gave a minimum 
ESS per second of 0.0537. The optimally tuned DAPsMRWM (m = 200 and 7 = 3; overall acceptance rate 
0.27%) gave a minimum ESS per second of 0.441 and a naive implementation of DAPsMRWM (to = 200 
and 7 = 0.9; overall acceptance rate 6.95%) gave a minimum ESS per second of 0.0982. For this application 
therefore, overall efficiency of DAPsMRWM (optimal) : DAPsMRWM (naive) : PMRWM scales as 8.2 : 1.8 
: 1. In this example, rj ss 0.0014. The right-hand plot of Figure 4 suggests that in this case the theoretical 
optimal efficiency (for a range of small but non-zero values of /? 1 and /? 2 ) is less than 10. Hence, unlike the 
toy example in Section 7.1, in this case the theory does not appear to exaggerate the achievable gain in 
efficiency. 


8 . Discussion 

We have provided a theoretical analysis of the delayed-acceptance pseudo-marginal random walk Metropolis 
algorithm (DAPsMRWM) in the limit as the dimensions, d , of the parameter space tends to infinity. Our 
analysis also applies to the delayed-acceptance random walk Metropolis (DARWM). 
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As with many other analyses (e.g. [RGG97, RR98]) we assume that the target has an iid product form. 
We then follow [STRR15] and [DPDK15] in assuming that the noise in the unbiased estimate of the posterior 
is additive on the logarithmic scale, with a distribution which is independent of the current position. We 
also assume that a cheap deterministic approximation is available for each component of the product, and 
that the error in each such approximation is a realisation of a random function. Individual realisations of 
the error are subject to only minor regularity conditions. As such, the error model is reasonably general and 
should capture the main characteristics of many real (deterministic) approximations. This is verified for a 
toy Bayesian inverse problem. 

We examine the above model as dimension d —t oo and we obtain limiting forms for the Stage One 
and the conditional Stage Two acceptance rates and the expected squared jump distance. We also obtain 
a diffusion approximation for the first component of the target, which justifies the use of expected squared 
jump distance as a measure of efficiency. 

Subject to the assumption of the Standard Asymptotic Regime, that was introduced in Section 2.2, we 
obtain simplified forms for the acceptance rates and for the efficiency in terms of both the mixing of the 
Markov chain and of the computational time. 

We find that as scale parameter of the random walk proposal, p, increases the Stage One acceptance rate 
decreases and as the variance of the logarithm of the stochastic estimator, er 2 , increases the conditional Stage 
Two acceptance rate decreases; thus efficiency can also be thought of as a function of these acceptance rates. 
We also show that the efficiency function has at least one local maximum, and computational investigations 
of the efficiency function all showed exactly one local maximum, indicating that the goal of optimising the 
efficiency is sensible. 

The theoretical work suggests that the DARWM can be worth implementing provided the deterministic 
approximation to the target is at least 10 times quicker to compute than the target itself, and the DAPsM- 
RWM can be worth implementing when the deterministic approximation is at least 30 times quicker to 
compute than the unbiased estimator with a 2 = 1. The theoretical work also supports the key intuition 
that, provided the cheap deterministic approximation is fast and reasonably accurate, the DAPsMRWM and 
DARWM algorithms should be optimally efficient when /i is much larger than (and the overall acceptance 
rate is much lower than) that of the equivalent PsMRWM or RWM algorithm. By contrast, for optimal 
performance with the DAPsRWM a 2 should, typically, be only a little smaller than the optimal value for the 
PsRWM. Furthermore the theoretical work leads to practical advice on tuning a DAPsMRWM algorithm, 
the key element of which is that, except for small values of a 2 , the optimal p is almost independent of a 2 , 
and hence, of the number of particles used; small values of a 2 , are associated larger optimal scalings. Thus, 
with care, the two-dimensional optimisation (over RWM scaling and the number of particles) can be reduced 
to two one-dimensional optimisations. 

A short simulation study on a toy example in d = 10 shows that the theory predicts the Stage One and 
Conditional Stage Two acceptance rates well; the efficiency relative to the equivalent non-DA algorithm is 
estimated relatively well for small efficiency gains but the theory overpredicts larger gains, because, with 
a larger scaling the process is further from the limiting diffusion. Nonetheless the theory reproduces the 
ordering of the empirical efficiencies fairly well, lending additional credence to theory-based tuning advice. 
Moreover, on a real example where a discretely observed Markov jump process is approximated through the 
Linear Noise Approximation, the theoretical efficiency predictions appear to be more accurate. In this real 
statistical example the optimal p is insensitive to a 2 , except for small values of a 2 , where it increases, as 
predicted by our theory, and supporting our recommended tuning strategy. 


Appendix A: Proof of technical results 

In this section we denote by <h(;r) = p[u ) du the cumulative Gaussian function with Lp{u) = e _ “ 2 / 2 /-\/27r. 

The bound 1 — <h(;r) < ip(x)/x for x > 0 is used in several places. 

A. 1. Proof of Proposition 4 ■ 2 

The only not entirely trivial parts of this proposition involve establishing that aq and a 2 |i are decreasing in 
p and a respectively. For proving that aq = g(— yp(l — Pi), p 2 (l + /3f — 2/3i)) is decreasing as a function 
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of p when fa < 1, note that since \fa\ < fa, 1 + j3 2 — 2fa > (1 — fa) 2 ] hence it suffices to show that for any 
positive constant c > 0 the function h : ^4 G(— p 2 ,c 2 p 2 ) is decreasing. Since h(p) = E [F(—fa 2 + cpfa] for 

a random variable £ ~ N (0,1) and F'(x) = e x I(x < 0) it follows that 

h'(p) = / F'{—fa 1 + cpz) (— 2/i + cz) ip(z) dz = / F'{—fa 2 + cpz) (— 2p + cz) ip{z) dz. 

J 2GM J z<.p /c 

This quantity is negative since —2p + cz < 0 on the event {z : z < p/c}. Proving that a 2 |i is decreasing 
as a function of a readily follows from the fact that for any fixed a £ K. the derivative of the function 
a H > G(—o 2 + a, 2a 2 ) < —\[2ip( —aj\pl + a/(<r\/2)) < 0 and differentiation under the integral sign. 


A. 2. Proof of Proposition 5.1 


Given the assumption in the proposition, the constraint (3.6) and since fa is positive 

ES(p,a-fa,fa,ri) = ^ - h 2,j2a 2 \i(hfa',fa,fa) < fav 2 a 2 \ifa,cr; fa, fa) 


Now define 


pa 2 + aq(/r; fa, fa) 


yl{p,,a;fafa) = 1 A exp ( --(1 - fa)p 2 + p(fa - 1)£ 


a* 2 (p,o;fafa,Z) = lAexp(-o 2 -^fap 2 -p,fa£ + V2oZ S J, 
and notice that aj is an increasing function of £ and a 2 is a decreasing function of £. Hence 


a 12 (p,a;fa,fa) = E i:Z [al(p,a;fafa)a2(p,a;fa,faZ)) 

< % [&* {p, a-, fa , 0] E^ z [a *2 (p, a; fa , £, Z)\ 

= a 1 (p;fa,fa)a PM {pV~fa,<j), 


where apM(p' ,c 2 ) is the asymptotic acceptance rate for the PMRWM algorithm with Gaussian noise 
in the log-target from Corollary 1 of [STRR15]. The corresponding efficiency function is Eff pm(p',&) = 
(p') 2 o 2 apM(fa, o' 2 ), and so 


Eff (Ah fa, fa, p) < -j-Eff p M {H'ffa,a) < 
fa 


— sup Eff PM (At,cr), 
Pi fJ,,a 2 


as required. 


A.3. Proof of Theorem 5.1 

Since Eff (p,a 2 ) = , for a fixed value of scaling p > 0 the efficiency functional goes to zero as 

a —► 0 and a —► oo. Similarly, the fact that the efficiency goes to zero as p —> 0 for any fixed value of a > 0 
is straightforward; it remains to verify that the efficiency also converge to zero as p —» oo. It suffices to show 
that p 2 ai 2 (p, a) -A 0; since for any x,y GlS. we have min (1, e x ) min (1, e v ) < min (1, e x+v ) we have 

aia < E [F (Q% + W A )] = 2 $ |-J 

and the conclusion readily follows. To finish off the proof of Proposition 4.2, it remains to consider potential 
stationary points of the efficiency functional Eff(/x, er 2 ). Taking the first derivative in the definition (5.2) 
yields that if a > 0 is a stationary point then 

2aiai2 

afaa 12 =--. 

pa z + a.\ 
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Plugging this estimate in the expression of the second derivative of Eff (/u, er 2 ) then shows that if o is a 
stationary point of the efficiency functional we have 

/ d a<T E& = —/ -(3 d a ai 2 + cr <9 CTCT ai 2 ). 

r)a z + a\ 

Differentiation of Equation (4.5) and straightforward algebraic manipulations then yield that the second 
derivative of the efficiency functional at a stationary point a > 0 also reads 


r]a 2 + a 1 1 \/2 

2 f ^crcrEff — X IE 

a [l Z J 2 


F(QZ + SZ)x<p{--^ 



This shows that 9 CTCr Eff is strictly negative if a 2 < 6. Consequently, on the interval cr 2 £ (0,6) the efficiency 
functional has no local minimum and has at most one local maximum. 


A.4- Proof of Equation (6.7) 


Equation (3.7) yields that R = W* — W for (W*,W) ~ ttw* < 8 > 7 r w has a density n R such that the function 
r i—^ e r / 2 n R (r) is symmetric i.e. e’’/ 2 t r R (r) = e ~ r / 2 i T R (—r). Similarly, algebra reveals that the joint Gaussian 
density ttq,s(q, s) of the pair {Q^ , ) described in Lemma 4.1 is such that 

e q/2 t T Q ,s(q, s ) = e~ q/2 7r Q; s(-g, -s). 

That is because — log 7 ^ 5 ( 9 , s) = aq 2 + bs 2 + cqs — q/2 + (constant) for some coefficients a, b, c £ R. 
Consequently, since the accept reject function F is such that e~ u F{u ) = F(—u) for any u £ R, the function 

g(q, r, s) = e (9+s)/2 F(r - s ) 7 r Q , s (q, s) n R (r) 

= e~ {r ~ s)/2 F{r - s) (e q/2 ttq,s(q, s)) (e r/2 tt R (r)^j 

is such that g(q , r, s) = g{—q, —r, — s). It follows that 


= E 


E [A(W)] = E [F'(Q°£ + S%)xF(R-S%)} = F\q + s ) F(r - s ) tt QiS (q, s ) tt R (r) dq dr ds 

J J ii 3 

e-( i+ s )/ 2 f'( q + s ) g(q, r, s) dqdrds = /// e^ q+a ^ 2 F'(— [q + s]) g(q, r, s ) dqdrds 
e« a+ sr F'{-[Q% + S%])F{R - S%) 

Consequently, since F'(u) + e“ F'(—u ) = F(u) for u £ R, it follows that 

2 x E [A(W)\ = E \f\Q% + S%) x F(R - S%) + e Q ± +s ~ F'(-[Qa + S%])F(R - S%) 

= E [F(Q£+SZ)) = a 12 . 


The proof that E \B(W)\ = 0 : 12/2 is similar and thus omitted. 


A. 5. Proof of Lemma 6.1 

In this section we need to consider asymptotic expansions of the type E X)UJ [...] = ^(x, w) + (error term), 
where (x,w) £ R d x R and (error term) = £d(x,w) for a function Ed : R d x R —> R. We use the notation 
(error term) = 0 ^ 2 ( 1 ) to indicates that, under the equilibrium distribution, the moment of order two of the 

error term is asymptotically negligible, E [e^X^, IE) 2 ] —> 0 as d —» 00 for (X( d \ W) ~ tt^ (g) 7 tw- Since ip 
is smooth with compact support, a second order Taylor expansion reveals that 

Jz?^V( X , w) = (drift term) ip' (x) + (1/2) (volatility term) p (x) + 0 ^ 2 ( 1 ) 


(A.l) 
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where the drift and volatility terms are given by the following conditional expectations, 

(drift term) = {1/S) x E x ,„, - xA a$ (x.M.XM-’.fM’j 


(volatility term) = (1/5) x E X) „ 


(x^’* - n) a $ (x, w, XW^’*) 


(A.2) 


,Z d ) 


(A.3) 


with Xj^’* = x + {ft/I) 5 1 / 2 Z^ and standard centred Gaussian random variable Z A) = {Z\,. 

• It readily follows from Lemma 4.1 that for 7r-almost every x we have 

(volatility term) = a 12 x {ft/ 1) 2 = J{ft) + o L 2 ( 1 ). 

• For the drift term, we make use of the following integration-by-part formula, also known as Stein’s 
identity, 

E[2x g{Z)\ = E [g'{Z)] for Z ~ N (0,1), (A.4) 

which holds for any continuous and piecewise continuously differentiable function j : IR -> K such 
that x H > max {g{x), g'{x)) is polynomially bounded. In what follows, F'{u ) = e“I u <o. The expression 
(3.12) for 0^2 (x, w, X^’*, W^’*), identity (A.4) and standard algebraic manipulations yield that 

(drift term) = 5 -1 / 2 {ft/I) E XjU) [Zi a^(x, X^’*, W^’*)] 

= {ft/I ) 2 E x ,4 F'{Q ( f> + s {d) ) F{\N^ - s (d) ) {e'(x[f*) + d x S{x[f*,'n)}} 

- {ft/ 1) 2 E^ W [F{Q^ + S { £>) F'(wi d) - S^) d x S{x[f*,'y 1 )] 

= {ft/I ) 2 A{w)£'{x 1) + {ft/ 1) 2 [A{w) - B{w)] d x S(x lt 71) + 0^2(1), 

where the functions A,B : K —> R + are defined in Equation (6.6) and the quantities Q^,S^ and 
in Equation (3.11). 

Plugging (A.5) and (A.3) into (A.l) shows that the limit 

&W<p{XW,W) - Atp(x[ d) ,W )' 2 

holds for (X< d ),IF) ~ ttW <S) nw, as required. 


(A.5) 


lim E 

d—f 00 


= 0 


A.6. Proof of Lemma 6.2 


The strategy of the proof is as follows. We define three stochastic processes jw^.j 


fc >0 






k> 0 


{W m , k } k>0 such that 


lim^oo P ( Wi d l = W, (d) 


= w k d) 


— rv k 

0 < k < 


lim^oc 


: 0 < k < T^ = 1 , 

tM) ‘= (w£l = wjf 
*-»■,* ■ 0 <k<TW) = 1 , 

k>Q is a Markov chain that is ergodic with respect to ttw- 


«! = w m<k 


0<k< T( d A 


(A. 6 ) 


Once (A. 6 ) is proved, Lemma 6.2 immediately follows. Let us now defines these three processes and verify 
that Equation (A. 6 ) holds. To do so, let us consider i.i.d sequences {Xj}j> 1 and {W/}i>i and {Zi tk }i,k>i 
and {U k }k>o respectively marginally distributed as n and 7 r^. and N (0,1) and Uniform(0,1). We consider 

{xi}i>i a realisation of {A/}j> 1 and for any index d > 1 we set Xg C ^ = {x \,..., xf) and ~ ir\y 

and recursively define = (x^’*, , with Xj/ 0 '* = X^ + {ft/I) 5 1/2 Zj, d) and Z ^ = 

{Z\ ,fc,..., Zd,k)> if 


Uk < F (Qi d) fe + S% k ) x F ( W* k - W, 


(d) _ c,(d) 

fe J A,fc 


(A.7) 
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and = otherwise. In the above 


Q { S k =Eti^(45 

' D,* . ' 

i ’ 

\\ is a DAPsMRWM Markov chain that targets 

/ J k" " 


S { / k = -s(xj e % 7i ) ■ 


[x[ d) ,wl d) 


k> 0 


Indeed, for any index d > 1 the process j ^ 

g) ttw • Let us now define the processes W^,W^,Wu- 
• We set W/l = Wf' 1 and recursively define w/\ +1 = W k if 

Uk < F (QW Aik + S%/ k ) x F ( W, : W™ - sf A> 

and W A \ +l = Wjjf k otherwise; we have used the notations 


(A.8) 


Q, 


id) 

+ ,A ,fc 




= (1*6/1) Ell V(xi)Zi, k + (^/2P) Eti *"(*0 


J *,A,fc = (/^/-O Ei=i<S'(xi,'Yi)Zi,k + (/i 2 <5 2 /2/ 2 ) E*=i«5"(a:i,7i). 
Similarly, we set W^q = Wq^ and recursively define if 

Uk < F (QW A>k + sW Aik ) x f (w fe * - W<J - 


(A.9) 


and W^ +1 = W A \ otherwise; we have used the notations \Q ^ A k , S/ A fc j to designate a Gaus¬ 
sian random variable in R 2 , independent from any other source of randomness, with same law as 

)( d ) o(d) 


(q. 


A, a, k> ^+,A,fc 


• Finally, we set W^ 0 = w/ and recursively define W^ +1 = W k if 


Uk < F (q<$ + si°S) x F ( W,: - W^l - 4°J) 


(A.10) 


and IFp d ^ +1 = W^. otherwise; in the above j (fi/l■, «)L 0 is an i.i.d sequence marginally 
distributed as [Q^\ \ see Lemma 4.1. 

It is obvious that | W/i } and | | have the same law. The fact that |VF^| is a Markov 

chain ergodic with respect to ttw readily follows from the fact that it is reversible with respect to *kw] it is a 
standard Gaussian computation. The proof of the first and third equation in (A.6) is based on the following 
basic remark. For convenience, let us denote by fik.%fiiflc the Bernoulli random variables indicating 

whether or not the respective events (A.7), (A. 8), (A. 9), (A.10) are realised or not. We have 

T( d ) 


l 


(wJJ = W/ : 0 < k < T^ P (4 d) ^ £ t 


id) 

A 


and the conditional probability P ( £/ ^ £ k d A 


{■ 


the event \ wff k = W k 


id) 


}• 


k =0 


w { /l = w/ 


Ki = ^ 


id)' 


(A.ll) 


is less than the expectation, conditioned upon 


of the absolute difference 


F (< Q { Sk + 4 d i) F (w* k - wf s/ k ) - F (q^ + s%/ k ) F ( 


w* k - w£l - s. 


id) 

+ ,A ,fc 


(A.12) 
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Because the [0, l]-valued function F is assumed to be Lipschitz, if w/\. = W / the absolute difference 

in (A.12) is less than 2 x ||F|| L ip x {|(3^ fc “ Q^.A,k 
derivatives of the log-likelihood function £ are globally bounded, a third order Taylor expansion yield that 


+ 


n( d ) Q 
D A,fc D - 


(d) 

'+,A ,k 


a Because the second and third 


E 


/o(^) _/n(^) 

^A,k ^Jf.,A,k 


< d~ 1/2 E 


E(>(45 )-*&)) Zi,k +dr 1 E z 2 

i =1 i— 1 


+ 0(d -1/2 ) 


< 


rf_1/2 E E (44?)-^) 

I t=i L 
= 0{k d _1/2 ). 


1/2 


TcT 1 i 


1/2 


0(d~ 1/2 ) 


We have used the fact that for any exponent p > 1 we have E 

follows from the triangular inequality. Similarly, we have that E 
estimates in (A. 11) shows that 


X; 


(d) 


s 


(d) 
A ,k 


1 1/P 


-s 


id) 

+ ,A,fc 


< kd 1 / 2 , which readily 
< kdr 1 ! 2 . Plugging these 


1 - P (43 = : 0 < k < T< d~ 1/2 E k -> ° 

k—0 

since T^ = d 1 for some exponent 7 € (0,1/4); we have thus proved that P = w/ : 0 < k < 

converges to one as d — > 00 . The proof of the estimate P (w^\ = \ : 0 < k < —> 1 uses the same 

ingredients and is thus omitted. 


Appendix B: Marginal likelihood under the linear noise approximation 

For simplicity of exposition we assume an observation regime of the form Y t = Ut + £t with e t ~ N (0, S) 
where e t is a length-d x Gaussian random vector. Suppose that Ui is fixed at some value Ui. The marginal 
likelihood (and hence the posterior up to proportionality) under the LNA, 7r a (yi :ra |x) can be obtained as 
follows. 

1. Initialisation. Compute 7r 0 (yi|x) = ip (yi; Ui, X) where ip (yi; up , S) denotes the Gaussian density 
with mean vector Ui and variance matrix X. Set ai = Ui and C to be the d x x d x matrix of zeros. 

2. For times t = 1,2,..., n — 1, 

(a) Prior at t + 1. Initialise the LNA with z t = a 4 , m t = 0 and V t = Ct . Note that m s = 0 for 
all s > t. Integrate the ODE system (7.2) forward to t + 1 to obtain z t +i and V t+ i. Hence 
Xt+i|yi :t ~ N (z t+ i,V t+ i) . 

(b) One-step forecast. Using the observation equation, we have that Y t +i|yi :t ~ N (z t+ i, V t+ i + X). 
Compute 7r Q (yi :t+1 |x) = 7r a (yi :t |x) ip (y t+1 ; z t+ i , V t+ i + X). 

(c) Posterior at t+ 1. Combining the distributions in (a) and (b) gives U t+ i |yi : t+i ~ N (a t+1 ,C t+1 ) 
where a t+ i = z t+ i+V t+ i(V t+ i + X) (yt+i — Zt+i) and C*+i = V, +1 -V, +1 ((V, +1 + X)- 1 V t+1 
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